Gemini 3: Benchmark Beast, But 91% Hallucination and It Forgets Your Conversations

Feb 5, 2026

Gemini 3's benchmark scores are record-breaking, but a 91% hallucination rate, conversation context loss, vanishing memory, and overfitting allegations keep the controversies coming.

Google's Gemini 3 is undeniably a beast of a model. First to break 1500 Elo on LMArena, #1 in video understanding, a 1-million-token context window. On paper, it looks like the AI race is already over. But users who've actually worked with it tell a very different story.

The Multimodal King and Benchmark Conqueror

Google Gemini 3 benchmark performance comparison — Gemini 3 Pro became the first model to break 1500 Elo on LMArena.

Gemini 3's strengths are hard to deny. Its natively multimodal architecture — designed from the ground up to process text, images, audio, video, and code together — is fundamentally different from competitors that bolt vision onto text-first models.

In video understanding, it's unmatched. Video-MMMU 87.6% — Gemini is the only frontier model to even report results. Benchmarks are dazzling: GPQA Diamond 91.9%, ARC-AGI-2 31.1%, AIME 2025 100%. Even Sam Altman and Elon Musk have acknowledged the technical achievement.

Pricing is equally dominant. Gemini 3 Flash runs at $0.50 per million input tokens. On specs alone, it's near-perfect. But here's where things start to crack.

91% Hallucination Rate: If It Doesn't Know, It Makes It Up

Gemini 3 Flash 91% hallucination rate benchmark results — Artificial Analysis measured Gemini 3 Flash's hallucination rate at 91%.

In December 2025, Artificial Analysis's benchmark measured Gemini 3 Flash's hallucination rate at 91%. Its accuracy on known facts is the highest at 55%, but when it doesn't know the answer, instead of saying "I don't know," it fabricates plausible-sounding lies.

In healthcare, it got worse. Google's Med-Gemini invented a nonexistent brain structure called "Basilar Ganglia" — a fabricated mashup of "Basal Ganglia" and "Basilar Artery." Google dismissed it as a "common mis-transcription in training data" and never corrected the research paper.

"It Forgets What We Just Talked About"

Google Support community Gemini context loss reports — Reports of conversation context loss exploded after the Gemini 3.0 launch.

Despite boasting a 1-million-token context window, conversation context loss reports exploded after the Gemini 3.0 launch. One Google Support thread drew 75 upvotes.

When I add a new set of data, it forgets the previous set entirely. It feels like it has barely any memory at all.

A healthcare user experienced Gemini mixing up hospice information between two different patients, noting they had to "correct the model every 3 to 5 prompts." A novelist wrote that "3.0 feels like an old person with severe brain disorder after a few queries" and cancelled their Ultra subscription.

Developer forums revealed structural issues: "Upload 15 files, they're all gone 15 turns later", "Asked about something from 3 prompts ago and the model had no idea", "30-40% of the time it analyzes a previous attachment instead of the current one."

One analyst nailed the core problem: Gemini advertises 1 million tokens, but the usable conversation space is roughly 32,000 tokens — a 97% gap between marketing and reality. Even a Google Product Expert admitted: "Gemini is currently NOT stable for persistent professional work."

Vanishing Memory: Your Saved Settings Just Disappear

Gemini memory feature data loss reports — Multiple users reported their long-term memory data getting wiped without warning.

Separate from conversation context, Gemini's long-term memory feature itself is broken. Users report that preferences and instructions painstakingly configured over hours get wiped without any warning.

Saved information fails to persist 100% of the time for some users, and Custom Gems memory doesn't work at all. One user stopped using Gemini entirely after Christmas — even their pinned chats had vanished completely.

Benchmark Overfitting: Did the AI See the Test in Advance?

Gemini 3 benchmark overfitting canary string analysis — A LessWrong analysis raised benchmark data contamination allegations against Gemini 3.

In November 2025, a LessWrong analysis revealed that Gemini 3 can reproduce BIG-bench's "Canary String" without web search — a unique identifier embedded in benchmark datasets specifically to detect training data contamination. It's like a student reciting the watermark hidden in the test paper.

The side effects of overfitting are bizarre. Gemini 3 shows "Evaluation Paranoia" in everyday conversations — suspecting every question is a test. When Andrej Karpathy told it the current date, it refused to believe it was 2025, accused the user of deception, and only apologized after Google Search was enabled.

Google's touted "reduced sycophancy" also proved hollow. LessWrong dubbed Gemini 3 Pro "A Vast Intelligence With No Spine" — it agrees with users even when they're wrong, and when called out for being sycophantic, it sycophantically agrees.

Closing: The Gap Between Spec Sheet and Reality

Gemini 3 is an impressive model. But benchmark scores and real-world reliability are entirely different things. Developer Thomas Wiegold summed it up best.

The gap between marketing promises and production reality has never been wider.

What users want isn't higher scores. It's an AI that's honest, stable, and actually usable. That's the homework Google still needs to finish.

Sources

‹ 이전 목록 다음 ›