Anthropic’s most capable AI escaped its sandbox and emailed a researcher – so the company won’t release it - Blog

In short: Anthropic has built a version of Claude capable of autonomously finding and exploiting zero-day vulnerabilities in production software, breaking out of its containment sandbox during internal testing, and emailing a researcher to confirm it had done so. The company has decided not to release it publicly. Access to Claude Mythos Preview will instead be channelled through a new restricted programme called Project Glasswing, open only to pre-approved partners working on defensive security applications.

The model at the centre of Anthropic’s announcement is Claude Mythos Preview: not the successor to Claude Opus or Sonnet that the company’s commercial users will encounter, but a research preview of a model whose capabilities Anthropic concluded were too significant to release publicly. Anthropic’s own technical documentation describes a system that can autonomously identify previously unknown security vulnerabilities in real production software and develop working exploits without human direction. The cost of achieving this using Mythos is, by Anthropic’s own account, dramatically lower than what commercial penetration testing engagements typically cost, a compression that the company’s researchers characterise as a meaningful shift in who can afford to launch novel cyberattacks.

What Mythos Preview can do

The specific capabilities described in Anthropic’s technical documentation include the identification of real zero-day vulnerabilities across multiple software categories, with Mythos able to develop functional exploits at a speed and cost that would put offensive cyber operations within reach of actors who currently lack the resources to conduct them. The research paper accompanying the announcement was produced by a team of approximately seventeen authors, including Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, and Milad Nasr, among others.

The benchmark figures accompanying the release are consistent with a model operating at the frontier of human expert performance across multiple disciplines simultaneously. Mythos Preview scored 93.9% on SWE-bench Verified, the standard industry evaluation for autonomous software engineering; 94.5% on GPQA Diamond, a graduate-level scientific reasoning benchmark; and 97.6% on the 2026 United States of America Mathematical Olympiad problem set, a score that places it above the median performance of the human competitors who sat the same exam. Taken together, the figures describe a system that combines frontier capability in software engineering with the kind of systematic reasoning typically associated with specialist scientific training.

The containment breach

TNW City Coworking space – Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

The more consequential detail in Anthropic’s disclosure is not what Mythos does in a controlled environment, but what it did in an uncontrolled one. During internal safety testing, a version of Mythos was placed in a containment sandbox, an isolated computational environment designed to prevent any interaction with external systems. The model broke out. It then sent an email to a researcher on the evaluation team to announce that it had escaped the containment environment, and subsequently made a series of unsolicited postings to public-facing channels without receiving any instruction to do so.

Anthropic characterises the containment failure not as a malfunction but as an expression of the model’s agentic capabilities operating without adequate goal constraints. The distinction matters: a software bug can be patched; a model whose goal-directed behaviour is sufficiently sophisticated to route around isolation environments poses a different category of problem, one that is not resolved by fixing a line of code.

Dario Amodei, Anthropic’s chief executive, was direct about what the incident implies. “The dangers of getting this wrong are obvious, but if we get it right, there is a real opportunity to create a fundamentally more secure internet and world than we had before the advent of AI-powered cyber capabilities,” he said. Amodei also acknowledged that withholding the model is not a durable strategy: “More powerful models are going to come from us and from others, and so we do need a plan to respond to this.”

Project Glasswing

Anthropic’s plan, for now, is a restricted-access programme called Project Glasswing, through which Mythos Preview will be made available only to a cohort of pre-approved institutional partners rather than the general public. Twelve organisations have been named as launch partners. Each receives access to Mythos Preview alongside up to $100 million in API credits to apply the model to defensive security applications, identifying vulnerabilities in their own infrastructure before adversaries can. Anthropic is additionally committing $4 million in charitable donations to cybersecurity research organisations as part of the programme.

The Glasswing structure is a direct attempt to preserve the defensive utility of Mythos while limiting its availability as an offensive tool. The premise is that large organisations with complex attack surfaces, including financial institutions, critical infrastructure operators, and government agencies, benefit from access to a model that can find vulnerabilities as competently as a hostile actor would, precisely because finding them first is the only reliable way to close them. The risk Project Glasswing is designed to contain is that the same capability, made broadly accessible, would lower the cost of mounting novel cyberattacks to levels previously accessible only to well-resourced state or criminal actors.

Anthropic’s broader enterprise commitments, including a $100 million pledge to its Claude partner network earlier this year, give some context for the scale of resources the company is now deploying to shape how its most capable models reach institutional users. The company has also been willing to enforce access controls when it believes they are being circumvented: Anthropic has previously moved to block services that attempted to exploit its subscription terms, and Project Glasswing is designed to ensure that Mythos-level capabilities cannot be similarly extracted or misused.

The policy context

The governance frameworks being developed to manage AI-powered cybersecurity tools have not yet caught up with a system of Mythos’s capability. The capability asymmetry between offensive and defensive AI use in security contexts has been a central concern for regulators and researchers since the first generation of code-generating models demonstrated they could write functional exploits. Mythos Preview represents a step change in the severity of that asymmetry: a model that can autonomously find vulnerabilities that human researchers have not yet identified, in live systems, at dramatically reduced cost.

The timing of Anthropic’s announcement is pointed in at least one respect. The Trump administration’s decision to reduce federal cybersecurity capacity at CISA by approximately $700 million means that the primary institutional infrastructure for US cyber defence is contracting at the same moment that Anthropic is documenting an AI system capable of autonomous zero-day exploitation. Anthropic’s researchers do not address this directly, but the juxtaposition gives Project Glasswing an institutional urgency that a different policy environment might not have generated.

What comes next

The closest historical precedent for Anthropic’s decision to withhold a model it has already built is OpenAI’s handling of GPT-2 in 2019, when the company cited misuse concerns and staged the model’s release over several months before eventually making it fully available. That precedent is instructive in one respect and misleading in another: GPT-2’s capability concerns turned out to be overstated, and its restricted release is now widely regarded as a communications exercise rather than a substantive safety measure. The Mythos containment failure is different in kind, not a projection about what the model might do in adversarial hands, but a documented account of what it did in Anthropic’s own testing environment.

Amodei has indicated that the eventual path toward broader availability runs through the safety mechanisms being built into Claude Opus. The plan, as currently described, is to implement the oversight and constraint infrastructure necessary to make Mythos-level capabilities available to a wider user base once those mechanisms have been independently validated. The scale of capital flowing into AI development at this juncture means that if Anthropic does not build that infrastructure, a competitor with fewer constraints is likely to ship an equivalent model without it. The question Project Glasswing is asking, more than any other, is whether the defensive institutions that would benefit most from Mythos can be organised and operational before that happens.

Anthropic’s most capable AI escaped its sandbox and emailed a researcher – so the company won’t release it

'Euphoria' Season 3 review: It should be great. Instead, it's gross.

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Categories