On April 16, 2026, Anthropic released Claude Opus 4.7. The press release reads as expected: "our most capable model", "substantially better at following instructions", "complex tasks with care and consistency". The independent benchmark results paint a more nuanced picture.
The Good Numbers
On paper, Opus 4.7 is a solid improvement over its predecessor Opus 4.6 -- especially in coding:
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% | +6.8 |
| SWE-bench Pro | 64.3% | 53.4% | +10.9 |
| Terminal-Bench 2.0 | 69.4% | 65.4% | +4.0 |
| MCP-Atlas | 77.3% | 62.7% | +14.6 |
| CharXiv-R (Vision) | 91.0% | 84.7% | +6.3 |
No question: anyone working daily with Claude Code will notice the difference on complex multi-file refactorings. The SWE-bench Pro jump of +10.9 points is impressive.
Where GPT-5.4 Wins
What's missing from Anthropic's announcement: OpenAI's GPT-5.4 leads in several categories.
| Benchmark | Opus 4.7 | GPT-5.4 |
|---|---|---|
| Terminal-Bench 2.0 | 69.4% | 75.1% |
| Humanity's Last Exam (with tools) | 54.7% | 58.7% |
| GPQA Diamond | 94.2% | 94.4% |
| BrowseComp (agentic search) | 79.3% | 84.0%+ |
Terminal-Bench 2.0 measures the ability to autonomously solve terminal-based tasks, precisely the use case Anthropic advertises with Claude Code. That GPT-5.4 clearly leads here with 75.1% vs 69.4% is relevant for anyone using AI agents in the terminal.
GPT-5.4 also leads on GPQA Diamond (graduate-level reasoning) and Humanity's Last Exam, albeit narrowly. Gemini 3.1 Pro also takes the win on multilingual tasks.
The Regressions
Particularly revealing are the areas where Opus 4.7 is worse than its predecessor:
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| BrowseComp | 79.3% | 84.0% | -4.7 |
| CyberGym | 73.1% | 73.8% | -0.7 |
BrowseComp measures agentic web search. A minus of 4.7 points is not noise, it is a measurable deterioration. With CyberGym (security tasks), Anthropic openly admits they "experimented with efforts to differentially reduce cyber capabilities". Heise reports that Opus 4.7 is "even slightly worse than its predecessor at reproducing security vulnerabilities". The intentional throttling may make sense from a safety perspective, but anyone paying for IT security audits gets less than before.
The Hidden Price Increase
The price per token remains identical: $5 per million input tokens, $25 per million output tokens. Sounds fair.
What Anthropic communicates less prominently: the new tokenizer produces up to 1.35x more tokens for the same text. In practice this means: same task, same prompt, up to 35% higher costs. Add the new xhigh effort level that consumes even more reasoning tokens. For teams with high API volume, this is a de facto price increase.
"Follows Instructions Literally" -- Blessing and Curse
Anthropic advertises that Opus 4.7 follows instructions "substantially better". In practice, this means: prompts that worked with Opus 4.6 can produce unexpected results. Bullet lists that earlier models treated as optional hints are now interpreted as hard requirements.
Anthropic itself warns: "Opus 4.7 follows instructions literally. Therefore, existing instructions should be reviewed." Anyone with an established system prompt setup gets to re-tune everything first.
My Verdict
Opus 4.7 is a good coding model, probably the best for SWE-bench tasks. But it is not "the most capable model ever" across the board. GPT-5.4 beats it in terminal tasks, general reasoning, and agentic search. It has measurable regressions compared to its predecessor. And the combination of a new tokenizer and higher token consumption makes it effectively more expensive, without changing the list price.
For my workflow (Claude Code for Symfony/Shopware projects), I will test Opus 4.7 once it becomes the default in Claude Code. But the days when a single model dominates all categories are over. The AI landscape has become a competition among equals and that is good for us users.
Kommentare
Kommentare werden von Remark42 bereitgestellt. Beim Laden werden Daten an unseren Kommentar-Server übertragen.