Sakana Fugu: Sovereign AI or Borrowed SOTA?

Editor J
Sakana Fugu: Sovereign AI or Borrowed SOTA?

Sakana AI pitches Fugu as 'sovereign AI.' Critics argue its Fable 5-level benchmark measures borrowed closed-model performance, not Fugu's own capability.

On June 22, Tokyo-based Sakana AI launched Fugu, a model orchestration system built on a multi-agent design. The system uses a 7B-scale 'conductor' model to distribute tasks across a pool of external frontier models, such as GPT-5.5, Claude, and Gemini, then verifies and synthesizes the results. By calling a single OpenAI-compatible endpoint, users interact with it as if it were a single model.

Sakana has positioned Fugu as sovereign AI, citing the June 12 US export controls that blocked access to Anthropic's top-tier Fable 5 and Mythos models. The startup argued that relying on a single provider's API for critical infrastructure creates a severe vulnerability, and framed its swappable pool as the path to sovereign AI. Fugu aims to mitigate this by swapping models and routing around outages. However, the launch sparked skepticism among developers, who questioned the validity of benchmarks achieved using borrowed models—a debate that quickly turned into a benchmark dependency problem.

Calling a Router 'Sovereign'

This skepticism stems from a simple fact: Fugu's model orchestration architecture is not new. Selecting the optimal model for a specific task and fusing the outputs has become a standard industry pattern. Frameworks like LangGraph and CrewAI automate this workflow, while OpenRouter's Fusion mode executes multiple models in parallel to synthesize a single response. Although Sakana highlights Fugu's trained 7B conductor as a key differentiator, the core model orchestration approach remains identical to existing routers.

The ICLR 2026 papers cited by Sakana (Trinity and Conductor) are indeed peer-reviewed and academically sound. They describe a reinforcement-learning-trained conductor that dynamically assigns roles—Thinker, Worker, and Verifier—while optimizing prompts to ensure a diverse pool of models outperforms any single model. The underlying engineering is not the point of contention.

Rather, the criticism targets the sovereign AI label itself. Industry observers argue that wrapping leased intelligence in a routing layer does not constitute sovereign AI. Digital Applied noted that the system remains dependent on external intelligence, stating, 'Fugu's capability is defined by its pool, which consists of other companies' models accessed via APIs.' If multiple API providers restrict access simultaneously, the pool shrinks. True resilience, critics emphasize, stems from diversity rather than self-reliance.

Measuring Dependency, Not Capability

Diagram showing Sakana Fugu coordinating multiple language models from a swappable agent pool
A diagram of Fugu's model orchestration architecture

This routing architecture also shifts how the benchmarks should be interpreted. Sakana compared Fugu and Fugu Ultra across 11 benchmarks against public models, including Opus 4.8, Gemini 3.1 Pro, and GPT-5.5, as well as Fable 5 and Mythos—the very systems it cannot actually access. On SWE-Bench Pro, the core coding metric, the outcome is telling: Fugu Ultra scored 73.7, outperforming Opus 4.8 (69.2) and GPT-5.5 (58.6), yet the top performance remains Fable 5 at 80.0, which Fugu cannot use.

SWE-Bench Pro scores (Sakana self-reported; baselines provider-reported)
ModelScoreNote
Fable 580.0Anthropic · not in Fugu's pool (limited access)
Fugu Ultra73.7Sakana · flagship tier
Opus 4.869.2Anthropic · provider-reported
Fugu59.0Sakana · balanced tier
GPT-5.558.6OpenAI · provider-reported
Gemini 3.1 Pro54.2Google · provider-reported

This outcome underpins the core critique: Fugu Ultra's performance is essentially driven by the closed-source SOTA models within its pool. Thus, claiming Fugu matches Fable 5 is near-tautological: routing tasks to superior models yields superior outputs. Consequently, the figures reflect a benchmark dependency on proprietary systems rather than Fugu's intrinsic capability.

The inability to independently verify these results deepens industry skepticism. Sakana did not disclose the specific models comprising the pool or the ratio of open-source to proprietary software. Because every benchmark figure relies on self-reported data against provider-reported baselines, testing conditions vary. Oddly, on benchmarks like SciCode, the standard Fugu model outperforms Fugu Ultra, suggesting that increased orchestration complexity does not guarantee better performance.

The Empty Cell the Marketing Hid

Under this framework, a routing system must demonstrate a different metric: the performance gains achieved relative to the single strongest model in its pool, known as the orchestration uplift. This delta must be substantial to justify the conductor model's presence. However, Sakana did not disclose this specific comparison, choosing instead to emphasize performance comparisons with Fable 5, which Fugu cannot access. The most critical data point remains absent.

Practical hurdles complicate the system's adoption. At launch, Fugu is unavailable in the EU and EEA as Sakana works to resolve GDPR compliance. This restriction locks out European regulated industries and critical infrastructure operators—the precise market segments most attracted to a sovereign AI pitch, and exactly where a benchmark of true independence would matter most. Additionally, bundling and reselling access to proprietary third-party models via a unified endpoint operates in a contractual gray area of provider terms, introducing compliance risks for enterprise adopters.

Ultimately, the vendor dependency risk highlighted when export controls severed Fable 5 access is a legitimate threat, and Fugu's concept of diversifying this risk is practical. However, the system's reliance on leased closed-source models means its utility remains bound to the availability of those external systems. If the primary models are restricted, the entire pool degrades. Critics argue that sovereign AI, in this context, merely abstracts dependency by another layer. Until Sakana demonstrates the isolated performance contribution of its conductor model, Fugu's results read less like a capability score and more like a benchmark dependency on proprietary AI.

Menu