"Smart Only at Launch?" Suspicions of Intentional AI Model Performance Degradation

Editor J
"Smart Only at Launch?" Suspicions of Intentional AI Model Performance Degradation

In March 2026, simultaneous allegations of intentional performance degradation emerged for both GPT-5.4 and Gemini 3.1. Gemini allegedly limits reasoning effort to 0.5 via hidden system prompts, while GPT-5.4's pass rate dropped 8 percentage points within 4 days of launch. The real question isn't whether these are rumors or facts — it's why users weren't informed if performance changed.

In March 2026, simultaneous allegations of AI performance degradation emerged from the industry's two biggest models. Google's Gemini 3.1 was exposed for allegedly limiting reasoning effort to 0.5 through hidden system prompts, while OpenAI's GPT-5.4 saw its Pass Rate drop from 58% to 50% within just four days of launch. These patterns are too consistent to dismiss as mere rumors.

Intentional AI performance degradation, known as 'silent nerf,' has been a long-standing suspicion in the industry. But this time, concrete data and independent verification are turning what was once conspiracy theory into a legitimate consumer rights issue.

Gemini 3.1 Allegations: Instructed to Think Less

Screenshot of tweet exposing Gemini reasoning effort 0.5 setting
@chetaslua's disclosure — A hidden system prompt in Gemini was found setting reasoning effort to 0.5

The controversy began with @chetaslua's disclosure that Gemini 3.1's hidden system prompt was found to set the reasoning effort level to 0.5. This setting was consistently applied to Pro models and Custom Gems, with only Canvas mode being exempt.

Initially, many dismissed this as AI hallucination. However, when multiple users independently tested and observed similar patterns, the allegations gained credibility. A bug in Gemini 3.0 Pro had previously been confirmed where reasoning in non-Canvas modes produced only "1-2 extremely short paragraphs."

Google's Gemini thinking level system is divided into LOW (1K tokens), MEDIUM (8K tokens), and HIGH (24K tokens). The core allegation is that even when users select HIGH in the app, the model actually operates at a lower level. Google has not issued any official response.

GPT-5.4 Allegations: Brilliant at Launch, Declining After

MarginLab Codex Historical Performance tracker graph
MarginLab Codex Performance Tracker — GPT-5.4-xhigh pass rate dropped from 58% to 50% in 4 days

Similar performance degradation patterns were detected in GPT-5.4. According to MarginLab's Codex Historical Performance tracker data, GPT-5.4-xhigh's Pass Rate dropped from 58.0% (95% CI: 44.2%-70.6%) on March 6 to 50.0% (95% CI: 36.6%-63.4%) on March 9 - an 8 percentage point decline in just four days. While statistical significance is insufficient, the directional trend itself raises concerns.

Even more alarming is the exposure of OpenAI's "Juice" system. According to @chetaslua, the internal name for reasoning effort is "juice," and it's differentiated by subscription tier: API users get 200, Pro ($200) subscribers get 128, Plus subscribers get 64, and free users get even less.

When Codex launched for free and attracted 200,000 new users, allegations emerged that reasoning effort was halved across all subscription tiers around the same time. Suspicions that xhigh requests on GPT-5.2 were being routed to Codex were also discussed in GitHub Issue #10438.

A Repeating Pattern: This Has Happened Before

This isn't the first time. AI model performance degradation allegations have shown a repeating pattern. In the 2023 GPT-4 "Lazy" incident, unintended performance degradation occurred after the November 11 update, which OpenAI officially acknowledged. That same year, a joint Stanford and UC Berkeley study academically verified that GPT-4's math accuracy plummeted from 97.6% to 2.4%, and code execution rates dropped from 52% to 10%.

In 2024, GPT-4o experienced quality degradation that led OpenAI to roll back the model and acknowledge excessive sycophancy. Anthropic also had three infrastructure bugs in 2025 that degraded Claude's performance, but they distinguished themselves as the only company to publish official postmortems.

The pattern is clear: praise floods in after launch, quiet changes follow, community backlash erupts, and then (sometimes) acknowledgment comes. The problem is that users receive virtually no prior notice throughout this cycle.

The Economics of AI Reasoning: Why Degrade Performance?

Behind these performance degradation allegations lies economic logic. In AI inference, reasoning tokens are billed as output tokens, so reducing reasoning directly cuts server costs. For companies, gradually reducing reasoning at levels users can't perceive becomes a tempting cost optimization strategy.

Multiple technical methods exist: quantization (reducing precision from FP32 to INT8), knowledge distillation (compressing large models into smaller ones), MoE (Mixture of Experts, activating only some specialists), and model routing (distributing requests to different models).

According to a16z, LLM inference costs are falling 10x per year for equivalent performance. But total costs are actually increasing due to user growth. This is the core of the 'silent nerf' theory: managing costs by fine-tuning models without notifying users.

Consumer Rights and AI Transparency: The Right to Know

In the SaaS industry, performance guarantees through SLAs (Service Level Agreements) are standard. Cloud services refund credits when they miss promised uptime, and prior notice is mandatory for feature changes. Yet in AI services, "prior notice for model changes" has not been standardized.

While UDAP (Unfair and Deceptive Acts or Practices) laws could apply, consumer protection regulations specific to AI are lacking globally. If paid subscribers receive "lower performance at the same price," this could constitute consumer deception. It becomes problematic if ChatGPT Pro ($200/month) and Plus ($20/month) users are receiving different service levels than when they first subscribed.

Anthropic stands alone as the only company to publish postmortems when performance degradation occurs. They transparently disclosed root cause analyses and resolution processes for three Claude infrastructure bugs in 2025. This level of transparency should be demanded from all AI companies.

If Performance Changed, Users Deserve to Know

AI model performance is not a "fixed product" but a "fluid service." Unlike traditional software, AI models can be changed server-side at any time, and users have difficulty detecting these changes. Whether the GPT-5.4 and Gemini 3.1 performance degradation allegations are rumors or fact has not yet been determined.

But there's a more important question: if performance changed, why weren't users informed? Users have the right to know about quality changes in services they pay for. What the entire industry needs is clear: transparency about performance changes, independent monitoring systems, and consumer notification obligations. As AI becomes more deeply integrated into our daily lives and work, this is an issue that can no longer be postponed.

Menu