Same-Day Showdown: Claude Opus 4.6 vs GPT-5.3-Codex Simultaneous Launch

Feb 6, 2026

Anthropic's Opus 4.6 and OpenAI's GPT-5.3-Codex launched simultaneously on Feb 5. The Terminal-Bench top score changed hands within 30 minutes — a full breakdown of the AI coding agent war.

On February 5, 2026, the two titans of the AI industry—Anthropic and OpenAI—simultaneously unveiled their next-generation models on the same day. Anthropic released Opus 4.6 while OpenAI launched GPT-5.3-Codex, forming a direct head-to-head confrontation over dominance in the AI agent era.

1. Claude Opus 4.6 — A Premium General-Purpose Model That 'Thinks Longer and Deeper'

Claude Opus 4.6 benchmark card — Claude Opus 4.6 official benchmark card (Source: Anthropic)

Claude Opus 4.6 announcement image (Source: Windows Report)

Opus 4.6 significantly enhances the coding capabilities of its predecessor Opus 4.5 while delivering dominant performance in general knowledge tasks. It's the first Opus model to support 1 million token context (beta), with maximum output doubled to 128K tokens.

The most notable change is 'Adaptive Thinking.' Previously, extended thinking was a binary on/off choice, but now the model autonomously determines whether to think deeply or respond quickly based on context. Additionally, reasoning effort levels (low/medium/high/max) allow developers to fine-tune the balance between intelligence, speed, and cost.

Benchmark results are impressive. It set the top score on Terminal-Bench 2.0 (an AI coding agent evaluation), and ranked first among all models in composite reasoning tests. In real-world knowledge task evaluations, it led OpenAI's GPT-5.2 by approximately 144 Elo points. In long-context retrieval, it scored 76%, vastly outperforming Sonnet 4.5's 18.5%.

Claude Code added an 'Agent Teams' feature in preview, enabling multiple agents to collaborate in parallel on large-scale tasks like code reviews. Pricing was maintained at $5 per million input tokens and $25 per million output tokens, the same as Opus 4.5.

2. GPT-5.3-Codex — The Coding Agent That 'Built Itself'

GPT-5.3-Codex announcement — GPT-5.3-Codex official announcement image (Source: OpenAI)

On the same day, OpenAI released GPT-5.3-Codex with the striking title of 'the first model to participate in its own training.' OpenAI revealed that early versions of GPT-5.3-Codex were used to debug its own training, manage deployments, and diagnose test results.

Performance-wise, GPT-5.3-Codex achieved an industry-best 56.8% on SWE-Bench Pro (a real-world software engineering evaluation) and 77.3% on Terminal-Bench 2.0. It scored 64.7% on OSWorld-Verified (productivity tasks in visual desktop environments), a major leap from GPT-5.2's 37.9%.

Versatility beyond coding was also emphasized. It supports presentation creation, spreadsheet analysis, and data processing pipeline construction, showing GPT-5.2-level performance across 44 professional knowledge task evaluations. A 25% speed improvement over its predecessor is also noteworthy.

3. Counterfire 30 Minutes Later — The Timing War

GPT-5.3-Codex benchmarks — GPT-5.3-Codex benchmark results (Source: OpenAI)

After Anthropic unveiled Opus 4.6, OpenAI responded with GPT-5.3-Codex just 30 minutes later. The timing was too precise to be coincidental. Industry observers believe OpenAI anticipated Anthropic's announcement and prepared a counter-launch strategy. In fact, OpenAI's Codex app had already been released on February 2, and 5.3-Codex was a model upgrade on top of it.

The most dramatic moment unfolded on the Terminal-Bench 2.0 leaderboard. The previous #1 was Factory's Droid + GPT-5.2 combination at 64.9%. Opus 4.6 claimed the top spot at 65.4% (max reasoning), edging ahead by just 0.5 percentage points. But merely 30 minutes later, GPT-5.3-Codex posted 77.3%, overtaking it by roughly 12 percentage points. Opus 4.6's record lasted only 30 minutes. However, since both companies used different execution environments (harnesses), direct model-to-model performance comparison has its limitations.

4. Head-to-Head Benchmark Comparison

Comparing the figures released by both companies: Terminal-Bench 2.0: Opus 4.6 65.4% vs GPT-5.3-Codex 77.3%. Interestingly, both chose different SWE-Bench evaluation versions. Anthropic reported Verified (80.84%) and Multilingual (77.83%), while OpenAI highlighted Pro (56.8%). The current SWE-Bench Pro leaderboard has Opus 4.5 at #1 with 45.89%, so if Opus 4.6's Pro score is released, the landscape could shift again. Both companies strategically showcased evaluation criteria favorable to their own models.

In real-world knowledge task evaluations, Opus 4.6 led GPT-5.2 by approximately 144 Elo points. GPT-5.3-Codex achieved a 70.9% win rate in the same evaluation, maintaining GPT-5.2-level performance. In desktop environment productivity task evaluations, GPT-5.3-Codex dominated at 64.7%. In cybersecurity CTF evaluations, GPT-5.3-Codex scored 77.6%, while Opus 4.6 used the expression 'industry-leading' instead of providing exact figures.

In long-context tasks, Opus 4.6 held a clear advantage. It scored 76% in retrieving key information from 1 million token-scale documents, vastly outperforming Sonnet 4.5's 18.5%. Opus 4.6 supports 1 million token context (beta) and 128K output tokens, while GPT-5.3-Codex opted for context compression to manage long sessions.

Closing: What Lies Ahead?

Coincidentally, this simultaneous launch came right after a war of words between the two companies over Super Bowl advertising. The two companies that were trading barbs on social media the day before released their next-generation models at the same time the very next day. Their fierce competition will continue to be the biggest spectacle in the AI industry.

Meanwhile, it's puzzling that GPT-5.3-Codex is explicitly described as a GPT-5.2-based model. Given the '5.3' numbering, some interpret it as closer to a coding agent-specialized fine-tune rather than a fundamental architectural change.

Sources

‹ 이전 목록 다음 ›

1. Claude Opus 4.6 — A Premium General-Purpose Model That 'Thinks Longer and Deeper'

2. GPT-5.3-Codex — The Coding Agent That 'Built Itself'

3. Counterfire 30 Minutes Later — The Timing War

4. Head-to-Head Benchmark Comparison

Closing: What Lies Ahead?

Categories