Claude Sonnet 5 raises the floor

Today, 30 June 2026, Anthropic shipped Claude Sonnet 5, which it calls the most agentic Sonnet model yet: a model that plans, works a terminal and a browser, and carries long multi-step jobs on its own, at performance the company describes as approaching Opus 4.8 at a fraction of the cost. It is generally available everywhere at once, on the Claude API, AWS, Google Cloud, and Microsoft Foundry, inside Claude Code, and as the new default model for every free and Pro user of the Claude apps.

The timing gives the release its shape. Three weeks ago we read the Fable 5 launch as one model behind two doors; two days later a government order closed both, and eighteen days on, Fable 5 and Mythos 5 are still dark. The most capable model Anthropic actually sells today is Opus 4.8. Into that gap Anthropic has not shipped a new ceiling. It has shipped a higher floor: a workhorse-tier model that runs a band below the flagship for a fraction of the price, and for an operating business the floor, not the ceiling, is where the work runs.

The span, not the column

A release table is usually read down the new model’s column. The more useful reading today is across each row: where the new workhorse lands on the span from the old workhorse to the flagship.

Figure 1. Each row runs from Sonnet 4.6 to the row's best model. On the agentic rows Sonnet 5 closes most of the span to Opus 4.8, on Humanity's Last Exam with tools it lands half a point short, and on GDPval knowledge work it edges past the flagship, 1618 to 1615.

On terminal work the span nearly closes: 80.4% against the flagship’s 82.7, where the old Sonnet managed 67.0. On Humanity’s Last Exam with tools the gap is half a point, 57.4 against 57.9. And on GDPval, the benchmark built from real occupational knowledge work, the workhorse is not approaching the flagship. It is past it, 1618 to 1615, with the old Sonnet at 1395. Coding keeps the clearest distance, 63.2 against 69.2 on SWE-bench Pro, which is exactly the shape you would expect: the flagship earns its price on the hardest engineering, and on everything an operations agent spends its day doing, the tiers have converged.

What the curve prices

When Fable 5 shipped we wrote that the most useful chart in a release is not the leaderboard but the accuracy-against-cost curve, because that curve is what an operating business actually buys. The same chart is the center of this release, and this time it moved at the tier a business would run every day.

Figure 2. BrowseComp pass rate against cost per task (log scale), by effort level. The old Sonnet's curve is stranded down and to the right; Sonnet 5's runs above the flagship's for most of its length, and its high setting clears Opus 4.8's entry point at the same spend.

Sonnet 5 exposes the same five effort levels as the flagship, and the whole curve sits up and to the left. Its high setting reaches what Opus 4.8’s low setting costs and lands a band above it. Its ceiling touches the flagship’s. And its floor prices work the old curves never priced at all: agentic search at useful accuracy for under two dollars a task. Anthropic’s own migration note is the plainest statement of the shift: Sonnet 5 at medium is comparable to Sonnet 4.6 at high, and Sonnet 5 at high is comparable to Sonnet 4.6 at max. The setting names stayed still while the intelligence under them moved up a rung.

The pricing carries one honest asterisk. Per-token prices are unchanged from Sonnet 4.6, $3 in and $15 out per million, against the flagship’s $5 and $25. But Sonnet 5 reads text through a new tokenizer that produces roughly 30% more tokens from the same words, so an equivalent request costs more than the sticker suggests, and Anthropic is bridging the difference with launch pricing of $2 and $10 through 31 August. The per-task curves above already have the new tokenizer priced in, which is why they, and not the per-token sticker, are the number to plan on.

The door arrives down-tier

In June we wrote that the two-doors release moved safety out of the weights and into the access tier. Sonnet 5 confirms that structure is not a flagship exception but the new architecture of the product line: it is the first Sonnet-tier model to ship behind real-time cyber safeguards, the same door Opus carries, and a request that crosses it comes back not as an error but as a completed response that declines, an HTTP 200 with a refusal as the stop reason.

Figure 3. Working exploits (solid) and register control only (faded) on the Firefox 147 evaluation built with Mozilla. Sonnet 5 develops no working exploit, 0.0% against the unguarded Mythos 5's 88.4%, yet it is the first Sonnet to ship behind the cyber door.

The irony is visible in the release’s own chart. On the Firefox exploit-development evaluation, Sonnet 5 produced zero working exploits; the gated Mythos 5 produced them 88.4% of the time. The model least able to do the dangerous work is the first of its tier to get the guard, and that is the point: the door is not a statement about this model’s character, it is the default posture of the product line now, fitted at every tier before the capability arrives rather than after. What we said about the structure in June holds one level down. What the model will do for you is decided at the door, and now every tier has one.

Where this honestly stands

The discipline of a dispatch is to mark the edges. On Anthropic’s automated behavioral audit Sonnet 5 scores 2.53, cleaner than the 2.89 of the model it replaces, but above Opus 4.8’s 2.10 and Mythos Preview’s 1.95: better than its tier, not yet the flagship’s temperament.

Figure 4. The automated behavioral audit, 1 to 10, lower is better. Sonnet 5 at 2.53 is cleaner than the Sonnet it replaces and short of the flagship's line: the safety gains are real, and the tiers are still tiers.

The API is stricter than the tier has ever been, in ways we flagged when Opus 4.8 asked for the same discipline. Sampling knobs are gone: setting temperature, top_p, or top_k returns an error rather than a result. Manual thinking budgets are gone; adaptive thinking is on by default and steered by the effort parameter. The model follows instructions more literally than its predecessors, scopes itself strictly at low effort, and does not generalize a request you did not make. Priority Tier capacity is absent at launch. None of these are defects; they are the flagship’s operating discipline arriving at the tier most software actually runs on, and prompts tuned to older, looser models will want a pass.

What to do with this

For a year the frontier story has been read at the ceiling: what the best model can now do, told in benchmarks nobody’s operation runs. June complicated that story from two directions. The recall taught that the ceiling can be switched off by a government you do not vote for, and today’s release makes the quieter, more durable point: the floor is rising underneath it either way. The tier priced for the everyday, the one a business would wire into the reservations system, the supplier thread, and the books, now runs last generation’s flagship band at three-fifths the per-token price, with the same effort dial and the same door.

We wrote in May that the capability overhang is a planning question, and every release since has widened it on a schedule nobody’s operation kept. The overhang was never about the ceiling. It is the distance between the floor and what a business actually runs on, and today that distance grew again. If the workhorse tier reaching flagship territory changes what an agent in your operation would cost to run every day, and it does, start a conversation with us about a Discovery Phase.

References

Anthropic. Claude Sonnet 5. 30 Jun 2026. anthropic.com/news/claude-sonnet-5
Anthropic. What’s new in Claude Sonnet 5. Claude Docs. platform.claude.com/docs
Anthropic. Prompting Claude Sonnet 5. Claude Docs. platform.claude.com/docs
Anthropic. Transparency Hub. anthropic.com/transparency

Company

Research

Notes