Claude Mythos: The AI Benchmark Breaker That Won't Be Released

Anthropic released Claude Mythos Preview on April 7, 2026 - and immediately announced it won’t be publicly available.

The reason? It’s too dangerous.

Despite being the most powerful AI model yet, scoring double-digit improvements over competitors across nearly every benchmark, Mythos is restricted to just 12 major tech companies for defensive cybersecurity work through Project Glasswing. But before diving into why, let’s look at what makes Mythos so exceptionally capable.

Record-Breaking Benchmarks

Claude Mythos demolished previous performance records across coding, mathematics, and reasoning tasks:

Coding: The Benchmark It Dominated

SWE-bench Verified: 93.9% (13.1 points ahead of Opus 4.6’s 80.8%)
SWE-bench Pro: 77.8% (20.1 points ahead of GPT-5.4)
SWE-bench Multimodal: 59.0% (more than double Opus 4.6’s 27.1%)
Terminal-Bench 2.0: 82.0% standard, 92.1% with extended timeout

These aren’t marginal improvements - they’re dramatic leaps that establish Mythos as the clear leader for software engineering tasks.

Mathematics: Where It Truly Excels

USAMO 2026: 97.6% (a staggering 55.3-point jump over Opus 4.6’s 42.3%)
GPQA Diamond: 94.5%

The USAMO improvement is particularly striking - a 55-point gap suggests Mythos has fundamentally different reasoning capabilities than current models.

Reasoning and Agentic Tasks

HLE with tools: 64.7% (12.6 points above GPT-5.4)
GraphWalks BFS (million-token contexts): 80.0%
CharXiv Reasoning: 93.2% with tools

Head-to-Head: Mythos vs. the Competition

The benchmark data tells a consistent story: Mythos beats GPT-5.4 on every shared benchmark and leads Opus 4.6 on “nearly every benchmark.”

Benchmark	Mythos	Opus 4.6	GPT-5.4	Mythos Lead
SWE-bench Verified	93.9%	80.8%	-	+13.1 pts
SWE-bench Pro	77.8%	-	57.7%	+20.1 pts
USAMO 2026	97.6%	42.3%	95.2%	+55.3 vs Opus
GPQA Diamond	94.5%	-	-	-
Terminal-Bench 2.0	82.0%	-	-	-

In absolute terms, Mythos hasn’t just pushed the frontier - it’s redefined where the frontier is.

The Memorization Question

A natural skepticism: could Mythos simply memorize its training data better? Anthropic addressed this with “extensive memorization screening,” filtering flagged potential contamination and testing models on novel “remix versions” of original questions. Result: Mythos maintained its lead at every level, even scoring higher on remixed questions than originals.

This suggests genuine capability gains, not data leakage.

Why You Can’t Use It: Project Glasswing

Despite dominating every benchmark, Mythos remains unavailable to the broader public. Instead, Anthropic partnered with 12 major technology and finance companies - including Amazon, Apple, Google, Microsoft, and Nvidia - through Project Glasswing.

The mission: use Mythos exclusively for defensive cybersecurity research, backed by $100M in usage credits from Anthropic.

The Cybersecurity Reality

Why the restriction? Mythos autonomously discovered thousands of zero-day vulnerabilities across:

Every major operating system
Every major web browser
One OpenBSD bug that had existed unchallenged for 27 years

An AI this capable at finding exploits isn’t something you release to the internet and hope for the best.

What This Means for the AI Industry

Claude Mythos represents a shift in how frontier AI models are deployed. The reasoning:

Raw capability matters less than control: A model 20 points ahead on benchmarks but publicly available creates less risk than a model 55 points ahead but tightly controlled.
Security is becoming a deployment constraint: Just as some biotech research remains restricted, the most powerful AI systems may need access controls built into their business model.
Benchmarks alone don’t tell the story: Mythos proves you can achieve dominant performance and still have legitimate reasons to restrict access.

Looking Forward

Claude Mythos Preview shows that Anthropic can build models that significantly outpace competition - while being transparent about risks. Whether Project Glasswing proves that restricted deployment can work at scale, or whether it’s a temporary measure before public release, remains to be seen.

What’s clear: we’ve entered an era where “the best model” and “the publicly available model” may be fundamentally different things.

Sources:

Record-Breaking Benchmarks#

Coding: The Benchmark It Dominated#

Mathematics: Where It Truly Excels#

Reasoning and Agentic Tasks#

Head-to-Head: Mythos vs. the Competition#

The Memorization Question#

Why You Can’t Use It: Project Glasswing#

The Cybersecurity Reality#

What This Means for the AI Industry#

Looking Forward#