Claude Sonnet 5

The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.

I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster.

Weak spots (categories it fails):

    - Trivia — 0/3 - basically not much built-in knowledge
    - Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
    - Puzzle Solving — score 77, flubs carwash-like tests

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

Got really excited for this model and asked my Opus planners in 3 pretty different projects to use Sonnets instead of Opus subagents to help me experiment on HPC kernels faster. Not one of them ended up writing a single line of code... Sonnets just kept spinning, wasting tokens. Can't remember the last time it happened with Opus in my codebases. Reverting back.

I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.

Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.

Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.

I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.

I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.

From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5

As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"

Claude Sonnet 5 itself described its pelican as looking like a goose:

> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.

https://simonwillison.net/2026/Jun/30/claude-sonnet-5/

Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.

Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."

Seems to be another great incremental update to the workhorse, nice!

I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.

And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.

What is the reference, unbiased, honest, reputable and trustworthy site that ranks and compare models on the couple of realistic metrics that matters ? ("Does it work for code", "no, I mean, for real", "how much does it cost", etc...) ?

Until now we've been using Sonnet 4 to power an editing agent in ApostropheCMS. Sonnet is a good price/quality/speed compromise, but sometimes when giving it a large set of instructions it would miss half of them. At least until we told it to go back and try again.

In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.

If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.

This is, obviously, early days but I'm impressed.

Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.

This line as a selling point is also pretty funny:

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

I only wish Opus 4.6 from earlier this year at a faster inference speed. Since Opus 4.6 things have been so much messier and the overall push for more agency isn’t really panning out for agent assisted development as much as they would like

Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.

In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).

When can we get a new Haiku? 4.5 came out nearly a year ago, and it's showing its age.

I didn't think they'd actually release a model that was worse than the open-weight frontier and at a higher price-point. Wow.

> Claude Opus 4.7 and later Opus models, Claude Fable 5, Claude Mythos 5, Claude Mythos Preview, and Claude Sonnet 5 use a newer tokenizer that contributes to their improved performance on a wide range of tasks. This tokenizer produces approximately 30% more tokens for the same text. Claude Sonnet 4.6 and earlier models use the previous tokenizer.

Seems like the cyber detection even is on Sonnet now. https://support.claude.com/en/articles/14604842-real-time-cy...

This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.

I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.

I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.

$5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.

That’s nice, but we want Fable

Anthropic outsmarted everyone again.

They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.

Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.

Very shady marketing.

And of course they lie about 35% again. In reality with coding it is 50%.

UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.

Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature

Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance

Tbh we'll see what using it looks like, but the reasoning/cost charts do not look promising. It seems like the only useful reasoning level for Sonnet 5 is Low; medium might trade blows at price/performance with Opus, but anything beyond that Opus is Just Better.

I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.

Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks

Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).

In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.

I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.

tldr: if you're doing something hard, just use a bigger model.

But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.

What I starting to hate is that each model's effort level can mean completely different power.

Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/

Let’s see how long until opus 5 comes out but to me this lends some credence to the rumour that fable/mythos was supposed to be opus 5

Is there any reason to use Sonnet instead of GLM?

Claude Sonnet 5 is built to be the most agentic Sonnet model yet.

The Dodge Charger is built to be the most Charger like car yet.

Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?

Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.

Unfortunately that means I won't be using it at work for now.

The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.

interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.

Why did this get the coveted "5"? I want an Opus that can compete with GPT 5.5

In my case, 4.6 degraded massively over time. 5 fails the same basic tasks that I gave 4.6 yesterday. And quite frankly this low, med, high, extra, max, turbo, ultra, ludicrous nonsense is getting tiresome

Who cares about Sonnet? I want to know about Fable. Are the export restrictions really going to be permanent?

interesting how much worse the sentiment around Anthropic is getting

I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

It seems being incompetent is a feature now...

Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.

> the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6

cool to see, still waiting for models to get better at computer use.

Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task

[0] https://github.com/dginovker/BFME-Source-Code/

Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.

This line as a selling point is also pretty funny:

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

Let’s see how long until opus 5 comes out but to me this lends some credence to the rumour that fable/mythos was supposed to be opus 5

Anthropic outsmarted everyone again.

Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.

Very shady marketing.

And of course they lie about 35% again. In reality with coding it is 50%.

UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.

Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks

This is, obviously, early days but I'm impressed.

Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?

Claude Sonnet 5 is built to be the most agentic Sonnet model yet.

The Dodge Charger is built to be the most Charger like car yet.

Seems like the cyber detection even is on Sonnet now. https://support.claude.com/en/articles/14604842-real-time-cy...

Why did this get the coveted "5"? I want an Opus that can compete with GPT 5.5

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

It seems being incompetent is a feature now...

I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.

[0] https://github.com/dginovker/BFME-Source-Code/

Unfortunately that means I won't be using it at work for now.

Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature

> the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6

cool to see, still waiting for models to get better at computer use.

Claude Sonnet 5 itself described its pelican as looking like a goose:

> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.

https://simonwillison.net/2026/Jun/30/claude-sonnet-5/

That's possibly the worst pelican I saw from all recent LLMs.

Meanwhile GLM 5.2 drew a cool self-contained fully animated SVG pelican:

https://simonwillison.net/2026/Jun/17/glm-52

I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster.

Weak spots (categories it fails):

    - Trivia — 0/3 - basically not much built-in knowledge
    - Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
    - Puzzle Solving — score 77, flubs carwash-like tests

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

Your benchmark has Gemini 3.5 Flash as the best model, which doesn't compute for me

As always, note: faster than GLM-5.2 doesn't mean too much, as GLM-5.2 is served by different providers, so the inference speed can vary drastically between providers or over time.

the (imperfect) comparison having used both for planning and execution is that GLM5.2 is too jumpy and eager to do things, often to a fault (e.g. deploying/using git when it shouldn't) while sonnet 5 was much lazier than any Claude model I have used has been, not adding an addendum to a plan that I asked for, then lying that it did when asked. Looking at the analysis[0] I don't think it's worth it for me. Maybe for others. Fable was certainly much better.

[0]: https://artificialanalysis.ai/models/claude-sonnet-5

I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

I still use Opus 4.6 (with later models for subagents only sometimes), but I have been preparing for it to go away.

So the post-introductory price is set such that Sonnet 5 will cost 100%-135% as much?

"We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."