Gemini 3 Deep Think Sweeps Benchmarks, but Users Remain Unimpressed

Gemini 3 Deep Think Sweeps Benchmarks, but Users Remain Unimpressed

Google's Gemini 3 Deep Think set new benchmark highs on ARC-AGI-2 and Codeforces, but users remain cold. Three months of hallucination backlash — VICE's 91% rate, Reddit's 'hallucination machine' label — haven't subsided despite Deep Think's arrival.

On November 18, 2025, Google launched Gemini 3 Pro and Flash, claiming the #1 spot on LMArena. But criticism over hallucination issues erupted almost immediately, and three months later, frustrations show no sign of fading. On February 12, 2026, Google DeepMind unveiled Gemini 3 Deep Think. It set new benchmark highs — ARC-AGI-2 at 84.6%, Codeforces Elo at 3455 — but users remain decidedly cold.

1. Deep Think Launch: New Benchmark Highs

Gemini 3 Deep Think ARC-AGI-2 HLE Codeforces benchmark comparison chart
Gemini Deep Think benchmark results (Source: Google DeepMind)

Deep Think is Gemini 3's specialized reasoning mode, optimized for scientific research and complex problem-solving. The launch also introduced Aletheia, an autonomous math proof agent that solved 13 unsolved Erdos problems. The benchmark scorecard is dazzling: ARC-AGI-2 at 84.6%, HLE at 48.4%, Codeforces at 3455 Elo — all first place. Science Olympiad scores like IMO 81.5% and IPhO 87.7% were equally striking.

Even compared to Claude Opus 4.6 and GPT-5.2, these numbers clearly lead in science and reasoning. Google also claimed first place on hallucination metrics — SimpleQA at 72.1% and FACTS Grounding at 70.5%. Meanwhile, speculation has spread that Deep Think is internally based on Gemini 3.1 Pro, triggered by the name 'Gemini 3.1 Pro Preview' appearing in a benchmark database. Google maintains the official name as 'Gemini 3 Deep Think.'

2. LessWrong's Benchmark Contamination Concerns: Distrust From Day One (Nov 20, 2025)

Skepticism toward Gemini 3 took root almost immediately. On November 20, 2025 — just two days after launch — the LessWrong community was already raising fundamental questions. The term 'evaluation paranoia' surfaced, sparking debate over whether AI benchmark scores could be trusted at all. The core issue: benchmark contamination. If training data includes benchmark problems, high scores may not reflect genuine reasoning ability.

This isn't a Gemini-only problem — it's structural across the AI industry. Three months later, Deep Think's impressive numbers should be viewed as an extension of this existing distrust, according to the LessWrong community.

3. Zvi's Critique: 'Narrative-Building' Called Out the Very Next Day (Nov 21, 2025)

The very next day, on November 21, AI analyst Zvi Mowshowitz weighed in with a sharp assessment. His core criticism: Google is 'narrative-building at the expense of accuracy or completeness.' The company constructs narratives around impressive benchmark numbers while the fundamental problem of real-world reliability gets pushed to the background.

Within just two days of launch, both the expert community and an influential AI analyst had opened fire almost simultaneously. The critique — benchmark numbers don't match real-world quality — appears to have lost none of its relevance even now.

4. Reddit and Google Help: Months of 'Hallucination Machine' (Nov 2025 – Jan 2026)

AI model hallucination rate reliability comparison chart Artificial Analysis
AI model hallucination rate and reliability comparison (Source: Artificial Analysis)

Alongside these expert criticisms, everyday user frustrations spread rapidly. On Reddit and Google Help forums, complaints accumulated for months. A primary complaint was the 'telephone game effect' in Gemini's Thinking mode. As the model progresses through long reasoning chains, small errors accumulate at each step. The final output ends up disconnected from the original question.

Criticism ranged from 'Gemini is a hallucination machine' to 'it ignores instructions and answers however it wants' and 'it's supposedly #1 on benchmarks, so why does it give nonsense answers to my questions?' On r/Bard, 'I'm canceling my Ultra subscription because Gemini 3 Pro is sh*t' drew 207 upvotes. 'What's insane is how terrible Gemini is despite Google having [its resources]' was another recurring sentiment.

5. VICE's 91% Hallucination Report: The Decisive Blow (Dec 23, 2025)

As nearly two months of frustration piled up, the decisive blow arrived. On December 23, 2025, VICE independently tested the Gemini 3 Flash model and reported a 91% hallucination rate — starkly at odds with Google's own benchmark figures. The testing methodology and target model differ, but the shock of '91%' among users was significant.

Since that report, 'What does SimpleQA first place even mean?' became a recurring reaction on Reddit and Hacker News. The prevailing sentiment: even if Deep Think excels at science and math, as long as hallucinations persist in everyday conversations, changing users' minds will be difficult.

6. Post-Launch Reality: User Experience Unchanged (Feb 2026)

GPT-5.2 Gemini 3 Claude benchmark performance comparison chart
GPT-5.2, Gemini 3, Claude benchmark comparison (Source: RD World Online)

So how are users who've actually tried Deep Think reacting? It's only been two days since launch, and access is limited to AI Ultra subscribers ($249.99/month), so reviews are scarce. But the early reactions echo the existing distrust. On r/Bard, a 'disappointed in Deepthink' post appeared on launch day. On r/GeminiAI: 'I was singing Gemini's praises last month and now I'm utterly disappointed.' Korean communities echoed the sentiment — on DC Gallery's Singularity board: 'The moment you step outside basics and move into papers or creative territory, the hallucinations get severe.'

One particularly telling critique: r/GoogleGeminiAI's 'The Hallucination Trap,' arguing Deep Think simply takes longer without reducing hallucinations. 'I tried Claude and was instantly impressed' drew 36 upvotes. On Damoang, a Korean user shared: 'I came back to ChatGPT because of the hallucination explosions.' Gemini even hallucinated about its own features — claiming Deep Think was available on Pro, then admitting that was itself a hallucination. Cynicism about pricing also spread: 'Deep Think was really Gemini 3.1 Pro — just to charge extra $250.'

Wrapping Up: Improvement Needed Beyond the Numbers

Gemini 3 Deep Think's benchmark performance is undeniably impressive. It dominated ARC-AGI-2, HLE, and Codeforces, and Aletheia's mathematical research achievements were meaningful. Deep Think's dominance in science and math is real.

But the hallucination controversy hasn't subsided despite Deep Think's arrival. LessWrong's benchmark contamination concerns, Zvi's narrative critique, persistent Reddit and Google Help complaints, VICE's 91% hallucination report — three months of accumulated distrust. Early Deep Think adopters are reporting hallucinations in specialized domains and even on the model's own specifications. Users are sending a clear message: simply setting new benchmark highs isn't enough. Solving the hallucination problem requires an approach fundamentally different from the benchmark score race.

Menu