Claude Mythos benchmark performance

Claude Mythos: The AI Benchmark Breaker That Won't Be Released

TL;DR Claude Mythos Preview set new records across coding, mathematics, and reasoning: 93.9% on SWE-bench Verified, 97.6% on USAMO 2026, and leads GPT-5.4 on every shared benchmark The USAMO result - a 55-point jump over Claude Opus 4.6 - suggests genuinely different reasoning capabilities, not just incremental improvement, and Anthropic screened against memorization concerns Despite dominating benchmarks, Mythos is not publicly available because it autonomously discovered thousands of zero-day vulnerabilities across every major OS and browser Access is restricted to 12 major tech and finance companies via Project Glasswing, a defensive cybersecurity research initiative backed by $100M in Anthropic usage credits The wider implication: we have entered an era where “the best model” and “the publicly available model” may be permanently different things, with security becoming a deployment constraint alongside capability Anthropic released Claude Mythos Preview on April 7, 2026 - and immediately announced it won’t be publicly available. ...

April 8, 2026 · 4 min · James M
DeepSeek R1 - the AI model that shook the industry

DeepSeek 🤯

TL;DR DeepSeek’s January 2025 release of R1 shook markets - a frontier-grade reasoning model trained for a reported $6M, a fraction of US lab budgets The app shot to #1 on Apple’s App Store inside days, and the open weights forced an industry-wide rethink of what training really costs Subsequent releases (V3 and beyond) cemented DeepSeek as a serious competitor in the open-source and cost-efficient AI category The story is less “China caught up” and more “the cost floor moved” - implications for closed-model pricing, GPU demand, and open-weight strategy Worth understanding as the moment that made cheap, capable, open models a credible default rather than a curiosity Overview In January 2025, a Chinese AI lab most people had never heard of dropped a frontier-grade reasoning model for a reported $6 million and watched it hit the top of the Apple App Store inside days. DeepSeek R1 did not just impress researchers - it shook equity markets, forced a hard look at what US labs were actually spending their billions on, and made cheap, capable, open-weight models a credible default rather than an interesting curiosity. ...

January 27, 2025 · 2 min · James M