Claude Opus 4.7: Benchmark Reality vs. Marketing Promises

Veröffentlicht am Apr 17, 2026 | ca. 3 Min. Lesezeit |

ki meinung

On April 16, 2026, Anthropic released Claude Opus 4.7. The press release reads as expected: "our most capable model", "substantially better at following instructions", "complex tasks with care and consistency". The independent benchmark results paint a more nuanced picture.

The Good Numbers

On paper, Opus 4.7 is a solid improvement over its predecessor Opus 4.6 -- especially in coding:

Benchmark	Opus 4.7	Opus 4.6	Delta
SWE-bench Verified	87.6%	80.8%	+6.8
SWE-bench Pro	64.3%	53.4%	+10.9
Terminal-Bench 2.0	69.4%	65.4%	+4.0
MCP-Atlas	77.3%	62.7%	+14.6
CharXiv-R (Vision)	91.0%	84.7%	+6.3

No question: anyone working daily with Claude Code will notice the difference on complex multi-file refactorings. The SWE-bench Pro jump of +10.9 points is impressive.

Where GPT-5.4 Wins

What's missing from Anthropic's announcement: OpenAI's GPT-5.4 leads in several categories.

Benchmark	Opus 4.7	GPT-5.4
Terminal-Bench 2.0	69.4%	75.1%
Humanity's Last Exam (with tools)	54.7%	58.7%
GPQA Diamond	94.2%	94.4%
BrowseComp (agentic search)	79.3%	84.0%+

Terminal-Bench 2.0 measures the ability to autonomously solve terminal-based tasks, precisely the use case Anthropic advertises with Claude Code. That GPT-5.4 clearly leads here with 75.1% vs 69.4% is relevant for anyone using AI agents in the terminal.

GPT-5.4 also leads on GPQA Diamond (graduate-level reasoning) and Humanity's Last Exam, albeit narrowly. Gemini 3.1 Pro also takes the win on multilingual tasks.

The Regressions

Particularly revealing are the areas where Opus 4.7 is worse than its predecessor:

Benchmark	Opus 4.7	Opus 4.6	Delta
BrowseComp	79.3%	84.0%	-4.7
CyberGym	73.1%	73.8%	-0.7

BrowseComp measures agentic web search. A minus of 4.7 points is not noise, it is a measurable deterioration. With CyberGym (security tasks), Anthropic openly admits they "experimented with efforts to differentially reduce cyber capabilities". Heise reports that Opus 4.7 is "even slightly worse than its predecessor at reproducing security vulnerabilities". The intentional throttling may make sense from a safety perspective, but anyone paying for IT security audits gets less than before.

The Hidden Price Increase

The price per token remains identical: $5 per million input tokens, $25 per million output tokens. Sounds fair.

What Anthropic communicates less prominently: the new tokenizer produces up to 1.35x more tokens for the same text. In practice this means: same task, same prompt, up to 35% higher costs. Add the new xhigh effort level that consumes even more reasoning tokens. For teams with high API volume, this is a de facto price increase.

"Follows Instructions Literally" -- Blessing and Curse

Anthropic advertises that Opus 4.7 follows instructions "substantially better". In practice, this means: prompts that worked with Opus 4.6 can produce unexpected results. Bullet lists that earlier models treated as optional hints are now interpreted as hard requirements.

Anthropic itself warns: "Opus 4.7 follows instructions literally. Therefore, existing instructions should be reviewed." Anyone with an established system prompt setup gets to re-tune everything first.

My Verdict

Opus 4.7 is a good coding model, probably the best for SWE-bench tasks. But it is not "the most capable model ever" across the board. GPT-5.4 beats it in terminal tasks, general reasoning, and agentic search. It has measurable regressions compared to its predecessor. And the combination of a new tokenizer and higher token consumption makes it effectively more expensive, without changing the list price.

For my workflow (Claude Code for Symfony/Shopware projects), I will test Opus 4.7 once it becomes the default in Claude Code. But the days when a single model dominates all categories are over. The AI landscape has become a competition among equals and that is good for us users.

Thomas Wunner

Certified IT specialist for application development with an instructor qualification and over 14 years of experience building scalable web applications with Symfony and Shopware. When not coding, Thomas volunteers as a lifeguard with the Wasserwacht, performs as a DJ, and explores the countryside on his motorbike.

Kommentare

Kommentare werden von Remark42 bereitgestellt. Beim Laden werden Daten an unseren Kommentar-Server übertragen.

Claude Opus 4.7: Benchmark Reality vs. Marketing Promises

The Good Numbers¶

Where GPT-5.4 Wins¶

The Regressions¶

The Hidden Price Increase¶

"Follows Instructions Literally" -- Blessing and Curse¶

My Verdict¶