Claude Fable 5: SOTA on Nearly Every Benchmark

Editor J
Claude Fable 5: SOTA on Nearly Every Benchmark

Anthropic launched Claude Fable 5, its first public Mythos-class model. It tops SWE-bench Verified at 95% and sets SOTA on nearly every benchmark.

Anthropic launched Claude Fable 5 on June 9, marking the first general release of its Mythos-class AI models—a tier previously withheld from the public due to safety concerns. The model is priced at $10 per million input tokens and $50 per million output tokens, representing less than half the cost of Mythos Preview.

In its official announcement, the company claimed that Fable 5 outperforms all of its previously released public models and achieves state-of-the-art results on nearly all evaluated AI benchmarks. Meanwhile, Claude Mythos 5—the same underlying model but with cybersecurity safeguards removed—is being distributed separately to Project Glasswing partners in collaboration with the US government.

Fable 5 Tops SWE-bench Verified at 95%, Outpacing Runner-Up by 6.4 Percentage Points

Anthropic's claims were confirmed on external AI benchmark leaderboards on launch day. On the independent coding evaluation leaderboard maintained by vals.ai, Fable 5 topped the SWE-bench Verified category with a score of 95.0%, outperforming Claude Opus 4.8 (88.6%) by 6.4 percentage points and GPT-5.5 (82.6%) by 12.4 percentage points. It also leads the harder SWE-bench Pro variant at 80.3%.

Claude Fable 5 benchmark score comparison table
Anthropic's official comparison table, lining up Fable 5/Mythos 5 against Opus 4.8, GPT-5.5, and Gemini 3.1 Pro

Early enterprise deployments have supported these benchmarks. Stripe, which participated in pre-release testing, reported that Fable 5 completed a codebase-wide migration of a 50-million-line Ruby codebase in a single day—a project that would typically require a full engineering team more than two months of manual effort. Stripe noted that the model compressed months of engineering work into a matter of days.

Beyond SWE-bench Verified, Fable 5 achieved the highest score among frontier models on Cognition's FrontierCode evaluation. Coding assistant startup Cursor also reported state-of-the-art results on its internal benchmark, stating that a class of complex, long-horizon problems previously considered intractable has begun to be resolved.

Record Performance Extends to Non-Coding Benchmarks

Performance improvements were not limited to software engineering. On Hebbia's financial benchmark, which evaluates senior-level reasoning, Fable 5 recorded the highest score of any model to date, demonstrating significant progress in document-based analysis and the interpretation of charts and tables. The model also established new performance standards for computer vision.

To demonstrate its vision capabilities, researchers tested the model on the game 'Pokémon FireRed'. While previous iterations of Claude were unable to complete the game even when assisted by custom software harnesses, Fable 5 completed the entire game using only raw screen captures as input.

Anthropic's evaluation data also indicated improved endurance during long-horizon tasks. In tests using the game 'Slay the Spire', the addition of file-based memory improved Fable 5's performance by three times the margin observed with Opus 4.8. Furthermore, a physics research partner reported that Fable 5 reached a milestone in 36 hours that required GPT-5.5 four days to achieve, while consuming only a third of the reasoning tokens.

Elite Benchmark Scores Tempered by User Backlash

Despite the record SWE-bench Verified score and the broader benchmark sweep, initial user sentiment across AI developer communities has been largely critical. Users have expressed frustration over a safety routing system that redirects cybersecurity and biology queries to the older Opus 4.8 model without explicit notification, as well as a pricing policy that will exclude Fable 5 from standard subscription tiers starting June 23. Discussions on platforms such as Reddit and Hacker News have described the subscription terms as misleading.

Offensive cyber security evaluation bar chart
Anthropic's offensive cyber evaluation chart — with safeguards on, Fable 5 (orange) stays near a 0% success rate across the board

These policy decisions were analyzed in detail in a separate report on the launch-day backlash. While the model's technical capabilities are widely acknowledged, Anthropic's routing and subscription changes have created friction with its user base, raising questions about how consumer sentiment will evolve in the coming weeks.

List Next ›
Menu