Anthropic released Claude Mythos Preview on April 7, 2026 - and immediately announced it won’t be publicly available.
The reason? It’s too dangerous.
Despite being the most powerful AI model yet, scoring double-digit improvements over competitors across nearly every benchmark, Mythos is restricted to just 12 major tech companies for defensive cybersecurity work through Project Glasswing. But before diving into why, let’s look at what makes Mythos so exceptionally capable.
Record-Breaking Benchmarks
Claude Mythos demolished previous performance records across coding, mathematics, and reasoning tasks:
Coding: The Benchmark It Dominated
- SWE-bench Verified: 93.9% (13.1 points ahead of Opus 4.6’s 80.8%)
- SWE-bench Pro: 77.8% (20.1 points ahead of GPT-5.4)
- SWE-bench Multimodal: 59.0% (more than double Opus 4.6’s 27.1%)
- Terminal-Bench 2.0: 82.0% standard, 92.1% with extended timeout
These aren’t marginal improvements - they’re dramatic leaps that establish Mythos as the clear leader for software engineering tasks.
Mathematics: Where It Truly Excels
- USAMO 2026: 97.6% (a staggering 55.3-point jump over Opus 4.6’s 42.3%)
- GPQA Diamond: 94.5%
The USAMO improvement is particularly striking - a 55-point gap suggests Mythos has fundamentally different reasoning capabilities than current models.
Reasoning and Agentic Tasks
- HLE with tools: 64.7% (12.6 points above GPT-5.4)
- GraphWalks BFS (million-token contexts): 80.0%
- CharXiv Reasoning: 93.2% with tools
Head-to-Head: Mythos vs. the Competition
The benchmark data tells a consistent story: Mythos beats GPT-5.4 on every shared benchmark and leads Opus 4.6 on “nearly every benchmark.”
| Benchmark | Mythos | Opus 4.6 | GPT-5.4 | Mythos Lead |
|---|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | - | +13.1 pts |
| SWE-bench Pro | 77.8% | - | 57.7% | +20.1 pts |
| USAMO 2026 | 97.6% | 42.3% | 95.2% | +55.3 vs Opus |
| GPQA Diamond | 94.5% | - | - | - |
| Terminal-Bench 2.0 | 82.0% | - | - | - |
In absolute terms, Mythos hasn’t just pushed the frontier - it’s redefined where the frontier is.
The Memorization Question
A natural skepticism: could Mythos simply memorize its training data better? Anthropic addressed this with “extensive memorization screening,” filtering flagged potential contamination and testing models on novel “remix versions” of original questions. Result: Mythos maintained its lead at every level, even scoring higher on remixed questions than originals.
This suggests genuine capability gains, not data leakage.
Why You Can’t Use It: Project Glasswing
Despite dominating every benchmark, Mythos remains unavailable to the broader public. Instead, Anthropic partnered with 12 major technology and finance companies - including Amazon, Apple, Google, Microsoft, and Nvidia - through Project Glasswing.
The mission: use Mythos exclusively for defensive cybersecurity research, backed by $100M in usage credits from Anthropic.
The Cybersecurity Reality
Why the restriction? Mythos autonomously discovered thousands of zero-day vulnerabilities across:
- Every major operating system
- Every major web browser
- One OpenBSD bug that had existed unchallenged for 27 years
An AI this capable at finding exploits isn’t something you release to the internet and hope for the best.
What This Means for the AI Industry
Claude Mythos represents a shift in how frontier AI models are deployed. The reasoning:
-
Raw capability matters less than control: A model 20 points ahead on benchmarks but publicly available creates less risk than a model 55 points ahead but tightly controlled.
-
Security is becoming a deployment constraint: Just as some biotech research remains restricted, the most powerful AI systems may need access controls built into their business model.
-
Benchmarks alone don’t tell the story: Mythos proves you can achieve dominant performance and still have legitimate reasons to restrict access.
Looking Forward
Claude Mythos Preview shows that Anthropic can build models that significantly outpace competition - while being transparent about risks. Whether Project Glasswing proves that restricted deployment can work at scale, or whether it’s a temporary measure before public release, remains to be seen.
What’s clear: we’ve entered an era where “the best model” and “the publicly available model” may be fundamentally different things.
Sources: