The $20 breach that exposed every enterprise AI program.
An autonomous AI agent compromised McKinsey's Lilli platform in two hours, for $20 in API tokens. The headlines called it a SQL injection. The real story is that the enterprise procurement model, built for SaaS, no longer fits the technology it is being used to buy.
Executive Summary . What you will learn
- The McKinsey Lilli breach was executed by an autonomous AI agent in two hours for $20 in tokens. No credentials, no insider access, no zero-day, and using a vulnerability class older than the people most likely to find it.
- The mistake was not the bug. It was the surface. CodeWall's agent enumerated 200+ publicly documented endpoints; 22 production endpoints required no authentication at all, including write endpoints. That is a defaults problem at scale, not a single oversight.
- The deepest exposure was write access to all 95 system prompts that govern Lilli's behaviour for 43,000 McKinsey employees. An attacker could have silently poisoned answers across the firm, invisibly, for an indefinite period.
- The enterprise procurement model was built for SaaS. Agents do not have eyes, do not see screens, and do not respect dashboards as permission boundaries. They operate through code, tokens, and APIs, and most enterprise software was never designed for that.
- Every major vendor is responding to the same problem. Anthropic, OpenAI, SAP, Pinecone, Salesforce, and ServiceNow all moved in May 2026 on identity, governance, persistent context, and machine-readable scopes for agents. The model is not the hard part anymore.
- Every leader must answer two questions: Does my AI platform distinguish between a human and an agent? And what is my platform's default security posture when the team is moving fast?
Contents
$20. Two hours. 46 million messages.
In March 2026, a research outfit called CodeWall published a disclosure that should have changed how every enterprise thinks about its AI platforms. Most read it as a McKinsey story. It is not.
Lilli is McKinsey's internal generative AI platform. It was named after Lillian Dombrowski, the first woman the firm hired, in 1945. By 2026, Lilli was used regularly by more than seventy-five percent of McKinsey's roughly forty-three thousand employees and answered over five hundred thousand prompts each month. It was trained on more than a hundred years of accumulated McKinsey intellectual property, more than one hundred thousand documents and interview transcripts. And critically, it was the only platform on which McKinsey employees were permitted to input confidential client data.
In early March 2026, an autonomous AI agent operated by a security research group called CodeWall compromised that platform. The agent was not directed by a human in real time. It was given a target and a budget. It worked through the surface for roughly two hours. It cost approximately twenty dollars in API tokens. And it walked away with read and write access to Lilli's production database.
The mechanics, briefly
The agent began with public enumeration. Lilli exposed more than two hundred publicly documented API endpoints. Of those, twenty-two required no authentication of any kind. No expired key. No misconfigured token. No clever exploit. They were simply open.
One of those unauthenticated endpoints accepted JSON input and, on the server side, concatenated the JSON keys, not the values, directly into a SQL query. The values were parameterized correctly, which is why most standard security scanning tools never flagged it. The agent, however, saw something a scanner would not: it noticed that the JSON keys it had submitted appeared verbatim in the database error messages that came back. That is the signature of a textbook injection vulnerability, and the agent immediately recognised it.
From there, the agent obtained full read and write access to Lilli's production database. No credentials. No insider. No zero-day. No human attacker. The most basic class of web vulnerability, found by an automated process, against a flagship enterprise AI platform.
What was exposed
The last of those numbers deserves more weight than the others. Forty-six million messages is a privacy disaster and an embarrassment. Ninety-five writable system prompts is a different category of threat entirely. It means an attacker could have changed how Lilli answered every consultant who used it, silently, for days, weeks, or months. Every strategy review, every client memo, every M&A analysis, biased in ways no one would notice until the impact reached the real world.
Write access to all 95 system prompts is functionally write access to the firm's accumulated answer-generating reflexes. CodeWall stopped at disclosure. A motivated adversary would not have. An adversary could have edited a single prompt, say, the one that shapes how Lilli summarises competitive intelligence, and influenced thousands of senior consultants without ever needing to be in the room.
The response
To its credit, McKinsey moved quickly. CodeWall disclosed the issue privately. McKinsey patched all twenty-two unauthenticated endpoints on March 2, 2026, the same day they were notified. CodeWall went public a week later, on March 9, 2026. McKinsey's public statement said the firm found no evidence of unauthorised access to client data beyond CodeWall's own research activity.
That is the right way to handle disclosure. It is also irrelevant to the larger lesson.
Why SQL injection misses the point.
If you read the breach as a SQL injection story, you will reach for the wrong fix. The bug is not the lesson. The defaults are the lesson.
SQL injection is one of the oldest vulnerability classes in web software. It has been on OWASP's most-watched lists for two decades. Every major web framework ships with parameterised queries as the default behaviour. Every security training programme covers it. Every commodity scanner looks for it. It is not interesting on its own.
What is interesting is that it reached production at all, in a flagship enterprise AI platform built by one of the most resourced and security-conscious institutions in the world. And what is more interesting still is the surface around it.
Twenty-two production endpoints with no authentication at all, including write endpoints, is not what a single distracted engineer produces. It is what a team produces over months, under deadline pressure, when the platform's default posture treats authentication as optional and the people closest to the code do not have enough authority to push back. It is a defaults problem at scale. It is, in the most precise sense, a procurement-and-process problem expressed in code.
Why the JSON-key trick mattered
The specific injection used by CodeWall's agent is worth a moment of attention because it explains why standard tools missed what an agent found.
Most modern web frameworks make it almost impossible to write a SQL injection through values, because values are bound to placeholders. Scanning tools assume that pattern. The Lilli endpoint did the right thing with values but the wrong thing with keys, concatenating the keys of an incoming JSON object directly into the SQL string used to build a query. A human security reviewer might have spotted it. A commodity scanner would not, because the values were safe. The CodeWall agent did, because it watched what came back: the JSON keys it submitted appeared verbatim in the database error messages. That is the diagnostic an experienced reviewer uses, and the agent ran it without prompting.
The point is not that this particular class of injection is novel. The point is that an autonomous agent, given a target and a small token budget, can think the way a senior reviewer thinks, fast, cheaply, and at every endpoint it can find. The economics of the offensive side of security have changed. The defensive side, in most enterprises, has not.
If a platform's default state is "any developer can stand up an unauthenticated endpoint and ship it," then under deadline pressure that is exactly what will happen. The Lilli breach is a window onto a defaults problem that is not unique to McKinsey. It is the default state of a large fraction of enterprise AI platforms shipping in 2026.
The procurement problem nobody is talking about.
Enterprise procurement was built for SaaS. It works on the assumption that the screen, the human user interface, is the practical permissions model. Agents have no eyes. That assumption is now wrong.
For two decades, enterprise software procurement evolved a coherent shape. Vendors competed on user experience. Buyers compared dashboards. Permissions were modelled around user roles, and the user's screen was where those roles were enforced visibly. Compliance teams asked about SSO, encryption at rest, audit logs, certifications. A senior consultant could only see what the dashboard let them see. The screen was, in practice, the permission boundary.
Agents do not look at screens. They send API requests. They consume scopes. They carry tokens. They cross system boundaries, CRM, support, contracts, product, internal wikis, at machine speed. The dashboard is no longer the constraint. The API is.
That is a different procurement problem. And almost no enterprise procurement process is built for it.
SaaS era vs. agent era
Lilli is the easy example, but the same pattern is now visible across enterprises that have deployed agents into CRM, support, finance, and HR. In each case, the procurement process examined the model's capabilities, the user interface, and the data residency story. In each case, the procurement process underweighted, or skipped entirely, the question of how an agent would be authenticated, scoped, audited, and revoked.
This is the unspoken implication of the Lilli breach. It is not a story about one company shipping one bug. It is a story about an entire procurement category that no longer matches the technology it is being used to evaluate.
Six major vendors. One message.
In a single quarter, through May 2026, six of the most important vendors in enterprise software made moves that, read together, point to a single conclusion: the model is not the hard part anymore.
Each of these moves is worth reading in its own right. Together, they form a pattern that no enterprise architect should ignore.
What the moves have in common
Read in isolation, each is a product launch. Read together, they describe the missing infrastructure that the Lilli breach made undeniable. Agents need identity, permissions, persistent context, governed workflow access, and auditability. None of that comes by default. All of it must be built, deployed, and reviewed. The vendors above are racing to be the layer where that infrastructure lives.
When Anthropic, OpenAI, SAP, Pinecone, Salesforce, and ServiceNow all move on identity, governance, and persistent agent context in the same quarter, the implication for buyers is not that one of them will win. The implication is that this is the layer where the next decade of enterprise software is being decided. Treating it as a procurement afterthought is a strategic mistake.
Build versus buy is the wrong frame.
Whether an enterprise builds its own Lilli or buys a packaged equivalent, the cross-system complexity does not disappear. It only changes location.
One reading of the Lilli breach is that McKinsey should have bought rather than built. Another reading is that they should have built differently. Both readings miss the structural point. The hard work in enterprise AI is no longer the model. It is the connective tissue: identity, permissions, persistent context, governed workflow, auditability. That work has to happen whether the system is built in house or assembled from vendors.
If a company builds, it owns the responsibility for designing a permissions model that distinguishes agents from humans, for ensuring that defaults are safe under deadline pressure, for instrumenting auditability into every cross-system call. If it buys, it owns the responsibility for evaluating vendors on those exact criteria, for refusing to accept "we support authentication" as the answer to "what is the default posture," and for ensuring that integration code does not undo what the vendor's own architecture made safe.
The shared failure mode
The shared failure mode is the same in both cases. Procurement happens. Implementation begins. Deadlines tighten. Defaults are inherited rather than chosen. Configuration written on day one is never revisited. Technical reviewers, where they exist, are consulted at the end of the process, to approve, not to shape. The result is a system whose security properties were determined by the decisions of the busiest people in the room on the worst week of the project.
Whether you are building or buying, the cheapest leverage you have on the eventual security posture of your AI platform is to move the technical conversation earlier. Permissions, identity, auditability, and reversibility are not late-stage IT concerns. They are early-stage business decisions, because they determine what the system can be trusted to do at all.
Two questions every leader must ask.
If the rest of this report does anything, it should give every senior executive the language and conviction to ask two questions of their AI platform, and refuse to accept polite-sounding answers.
Question 1. Does my AI platform distinguish between a human user and an AI agent?
This sounds technical. It is not. It is the single most consequential governance question an executive can ask about their AI program right now.
If the answer is no, then your platform's permissions model treats an agent's access the same as its operator's access. An agent acting on behalf of a junior analyst inherits the agency of a senior partner, because the platform cannot tell them apart. That is precisely the condition that turned a single API failure at Lilli into firm-wide exposure across 43,000 employees.
An agent-aware platform looks different. Agents have narrower, task-specific, expirable access, distinct from any human's access. Agent activity is auditable as agent activity, not as the activity of the human whose token was borrowed. Agent permissions can be revoked in real time, without revoking the human's. And the platform can answer a regulator's question, what did the system do on behalf of which user, in what role, under what scope, with what evidence, without ambiguity.
- Why it matters now: Regulators in 2026 are beginning to ask exactly this question. Boards will follow.
- Why "we support authentication" is not the answer: Authentication of a user is not the same as identification of an agent. Both must exist independently.
- Why blast radius is the test: If one agent fails, what is the maximum exposure? If the answer is "all users," the agent model is wrong.
- Why revocation matters: An agent's access should be killable in seconds, not at the next scheduled review cycle.
Ask your platform owner: "Show me the audit log for an action an agent took on behalf of a user yesterday. I want to see the agent identity, the scope, the parent user, the action, and the time of access grant and revocation." If they cannot produce that log without engineering work, your platform does not yet distinguish between humans and agents.
Question 2. What is my platform's default security posture under pressure?
The marketing question is whether your platform can be secured. The serious question is what it does when nobody is watching.
Every enterprise AI program goes through periods of acute deadline pressure. Launches, integrations, customer pilots, executive demos. The decisive question is not whether the platform supports authentication and authorisation in theory. The decisive question is what the platform does when an engineer in a rush adds a new endpoint at 4 PM on a Thursday before a Friday demo. Does the platform require explicit authentication configuration before the endpoint is reachable? Does it default to closed? Or does it allow the endpoint to be stood up unauthenticated, with the assumption that someone will go back and fix it later?
The Lilli breach is the closed-form answer to the question of what happens when the default is "open and we will fix it later." Twenty-two endpoints, in production, shipped that way. Not because a single engineer was negligent, but because the platform's defaults made it the path of least resistance.
- Out-of-the-box posture: What is the platform doing on day one, with no operator action?
- Configuration drift: If nobody revisits the configuration after the initial setup, what state does the platform settle into?
- Early architect involvement: Do technical architects have authority to refuse a launch, or are they invited to approve one?
- Vendor claims vs. defaults: "We support authentication" is a feature claim. "Authentication is enforced by default and cannot be turned off without an architect's sign-off" is a posture claim. Buyers should accept only the second.
You do not get to choose what your platform does under pressure. Your defaults choose for you. If the default state of your AI platform is open until configured closed, then under deadline you will ship open. The cheapest thing you can do as a leader is to insist that your defaults be closed until configured open, and to bring your technical reviewers into the room early enough to make that the case.
What changes tomorrow.
The Lilli disclosure is being read in three ways. The first reading is the security reading: McKinsey shipped a SQL injection, they patched it quickly, and the system worked. The second reading is the privacy reading: forty-six million messages were exposed and that is the story. Both readings are correct, and both are inadequate.
The third reading is the one this report is built around. It is that Lilli was a procurement failure dressed as a security incident. It is that the enterprise procurement model, built for the SaaS era, no longer fits the technology it is being used to evaluate. It is that twenty-two unauthenticated write endpoints in production are not what a careful engineer produces; they are what a system produces when its defaults are inherited rather than chosen, and when its technical architects are consulted to approve rather than to shape. And it is that the most consequential, least expensive change a leader can make is to move the technical conversation earlier, into procurement, into vendor evaluation, into the first design conversations of any new AI platform.
Every major vendor, Anthropic, OpenAI, SAP, Pinecone, Salesforce, ServiceNow, is racing to build the layer where agent identity, governance, persistent context, and auditability live. The market signal is unambiguous. The model is not the hard part anymore. The hard part is what the model is allowed to do, on whose behalf, with what evidence, and with what reversibility.
Two questions, asked early and answered honestly, are worth more than any compliance certification. Does your AI platform distinguish between a human and an agent? And what is its default security posture when the team is moving fast? If you cannot answer the first, your blast radius is your entire user base. If you cannot answer the second, your platform's worst-case state is its likely state.
The Lilli breach cost twenty dollars and two hours. The procurement decisions that preceded it cost considerably more, and they were made by people who would have been horrified to be told they were making security decisions. They were. The lesson is to stop separating those conversations.
Sources & references.
All facts in this report are drawn from public disclosures, vendor announcements, and named research outputs as of May 2026.
- CodeWall Research Disclosure, Public disclosure of the McKinsey Lilli platform compromise, March 9, 2026. Includes technical write-up of the autonomous agent attack chain, endpoint enumeration, and JSON-key SQL injection vector.
- McKinsey & Company, Public statement regarding the Lilli platform, March 2026, confirming endpoint patching by March 2, 2026 and stating no evidence of unauthorised access to client data beyond CodeWall's research.
- McKinsey & Company, Lilli platform background, internal AI strategy publications, including usage statistics (~75% of ~43,000 employees, 500,000+ prompts/month), training corpus (100,000+ documents and interview transcripts), and naming after Lillian Dombrowski (first woman hired by McKinsey, 1945).
- Anthropic, Enterprise AI joint venture announcement, May 4, 2026. $1.5B valuation, investors including Blackstone, Hellman & Friedman, General Atlantic, and Sequoia. Applied AI engineering model.
- OpenAI, "The Development Company" announcement, May 2026. Approximately $4B raise from TPG, Brookfield, Advent, and Bain Capital.
- SAP, WalkMe acquisition (~$1.5B, 2024) and subsequent integration with Signavio, LeanIX, and the Joule AI assistant, comprising SAP's persistent enterprise AI data layer.
- Pinecone, Nexus launch announcement, May 2026. "Knowledge Engine for Agents," compilation-stage context, reported token reduction and task-completion improvements over conventional RAG pipelines.
- Salesforce, Headless 360 announcement at TDX 2026; Agent Fabric governance and orchestration layer.
- ServiceNow, MCP Registry announcement at Knowledge 2026; vetted internal catalog of approved MCP Servers for governed external agent action.
- OWASP Foundation, Top 10 Web Application Security Risks reference for the historical context of injection-class vulnerabilities.