Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

In short: Anthropic has released Claude Opus 4.7, its most capable generally available model, with benchmark-leading scores on SWE-bench Pro (64.3% vs GPT-5.4’s 57.7%), multi-agent coordination for hours-long workflows, 3x higher image resolution, and a 14% improvement in multi-step agentic reasoning with a third of the tool errors. Priced at $5/$25 per million tokens, it is available across Claude plans and through Amazon Bedrock, Vertex AI, and Microsoft Foundry.

Anthropic has released Claude Opus 4.7, its most capable generally available model to date, with benchmark-leading performance in software engineering and agentic reasoning that widens the gap between Claude and both OpenAI’s GPT-5.4 and Google’s Gemini 3.1 Pro on the tasks that matter most to developers and enterprise users.

The release comes at a moment when Anthropic’s commercial momentum is difficult to overstate. The company is running at a $30 billion annualised revenue rate, has attracted investor offers at roughly $800 billion, and is in early IPO talks. Opus 4.7 is the model that has to justify those numbers, not by winning every benchmark, but by being the model that enterprises and developers choose to build on.

Where it leads

The headline numbers are in software engineering. On SWE-bench Pro, the benchmark that tests a model’s ability to resolve real-world software issues from open-source repositories, Opus 4.7 scores 64.3%, up from 53.4% on Opus 4.6 and well ahead of GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. On SWE-bench Verified, a curated subset, the score is 87.6%, compared with 80.8% for its predecessor and 80.6% for Gemini 3.1 Pro.

TNW City Coworking space – Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

CursorBench, which measures autonomous coding performance in the popular AI code editor, shows a similar jump: 70%, up from 58% on Opus 4.6. For a model that is already the default choice in Cursor and Claude Code, the improvement on the benchmark most directly tied to how developers actually use it is significant. Claude Code alone hit $2.5 billion in annualised revenue in February, and AI-assisted coding has become one of the fastest-growing categories in software.

On graduate-level reasoning, measured by GPQA Diamond, the field has converged. Opus 4.7 scores 94.2%, GPT-5.4 Pro scores 94.4%, and Gemini 3.1 Pro scores 94.3%. The differences are within noise. The frontier models have effectively saturated this benchmark, which means the competitive differentiation is shifting away from raw reasoning scores and toward applied performance on complex, multi-step tasks.

The agentic step

Opus 4.7’s most consequential improvements may not be captured by any single benchmark. Anthropic says the model delivers a 14% improvement over Opus 4.6 on complex multi-step workflows while using fewer tokens and producing a third of the tool errors. It is the first Claude model to pass what Anthropic calls “implicit-need tests,” tasks where the model must infer what tools or actions are required rather than being told explicitly.

The model also introduces multi-agent coordination, the ability to orchestrate parallel AI workstreams rather than processing tasks sequentially. For enterprise users running Claude across code review, document analysis, and data processing simultaneously, this is the kind of capability that translates directly into throughput. Anthropic says Opus 4.7 is engineered to sustain focus over hours-long workflows, a claim that, if it holds, addresses one of the most common complaints about frontier models: that they lose coherence and precision on extended agentic tasks.

Resilience is another emphasis. The model is designed to continue executing through tool failures that would have stopped Opus 4.6, recovering and adapting rather than halting. For automated pipelines where a single failure can cascade, this kind of robustness matters more than marginal benchmark gains.

Vision and context

Opus 4.7 processes images at resolutions up to 2,576 pixels on the long edge, more than three times the capacity of prior Claude models. The improvement is aimed at enterprise document analysis, where scanned contracts, technical drawings, and financial statements often contain fine print and detail that lower-resolution vision models miss or hallucinate.

The context window remains at one million tokens, half of Gemini 3.1 Pro’s two million but sufficient for most enterprise use cases. On long-context research benchmarks, Opus 4.7 tied for the top overall score at 0.715 across six research modules and delivered what evaluators described as the most consistent long-context performance of any model tested.

Anthropic notes that the model follows instructions more literally than its predecessors, a change that may require users to adjust existing prompts. This is a trade-off: tighter instruction-following reduces the ambiguity that sometimes produces creative or unexpected outputs, but it also reduces the hallucination and off-task behaviour that frustrates enterprise deployments.

Pricing and availability

Opus 4.7 is available immediately on Claude Pro, Max, Team, and Enterprise plans, and through the API at $5 per million input tokens and $25 per million output tokens. Prompt caching offers up to 90% cost savings, and the Batch API provides a 50% discount on both input and output. The model is also available through Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry.

The pricing is unchanged from Opus 4.6, which means Anthropic is delivering substantially better performance at the same cost. Gemini 3.1 Pro undercuts it at $2 and $12 per million tokens for input and output respectively, but Opus 4.7’s lead on the benchmarks that enterprise buyers care about, particularly SWE-bench and agentic reasoning, may justify the premium for customers whose workloads demand the highest capability.

Anthropic has also added cyber safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses, a nod to the dual-use concerns that led the company to restrict its more powerful Mythos model to just 11 organisations under Project Glasswing.

What it means

Opus 4.7 is not a paradigm shift. It is a meaningful improvement across every dimension that matters to the people who pay for Claude: better coding, better agentic reasoning, better vision, better instruction-following, and better resilience on long tasks. The model does not win every benchmark against every competitor, but it wins convincingly on the ones most directly tied to real-world productivity.

For Anthropic, the release reinforces the position that has driven its extraordinary revenue growth. Claude is the model that developers and enterprises reach for when they need reliable, high-quality output on complex work. Opus 4.7 extends that lead at a moment when the company’s commercial trajectory depends on it. The competition is close, and closing. But for now, on the tasks that generate the most revenue, Anthropic has the best model on the market.

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Get 3 months of new Audible Standard subscription for just 99 cents per month

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Categories

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Where it leads

The agentic step

Vision and context

Pricing and availability

What it means

Get 3 months of new Audible Standard subscription for just 99 cents per month

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Categories

Get more stuff like this in your inbox

Get more stuff like this
in your inbox