This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.
Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.
Very shady marketing.
And of course they lie about 35% again. In reality with coding it is 50%.
UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.
In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.
If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.
This is, obviously, early days but I'm impressed.
or
The Dodge Charger is built to be the most Charger like car yet.
It seems being incompetent is a feature now...
Unfortunately that means I won't be using it at work for now.
cool to see, still waiting for models to get better at computer use.
> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.
Weak spots (categories it fails):
- Trivia — 0/3 - basically not much built-in knowledge
- Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
- Puzzle Solving — score 77, flubs carwash-like tests
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
Meanwhile GLM 5.2 drew a cool self-contained fully animated SVG pelican:
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.
People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)
I keep specific branches a state where they are ready to develop new features.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
I can't help but feel this is intentional towards the 'Agentic' workflow.
the incentives aren't there sadly
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that
20 minutes after the announcement there's no real useful statement that can be made about it.
Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.
Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools).
I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.
Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.
If this was the last model I could ever use I think I would be happy.
As with any new model, you won't know the real impact until you start using it for your workload.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.
I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.
I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.
Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.
Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.
IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.
I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then
z.ai doesnt always have the most reliable AI
but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh
So I’ll consider all, fastest tokens/sec wins
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
If you set off a classifier, that's how it looks to Claude.
It makes some sense, as models are trained more and more with reasoning, than without.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...
There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.
Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.
Understandable frankly.
This may be the goal.
At least for Claude family models.
e.g. {
"reason": "<Describe why you picked this result>",
"selection": "<The number of the value you selected>"
}I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.
Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.
Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.
I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.
Maybe it works well on things that aren’t clear in the benchmarks.
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
"They took my shit away!" -- 3-day Fable 5 addicts (me)
"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types
"Great to see a closed source company fail!" -- open source boosters
"Great to see an American company fail!" -- anti-US, and/or pro-China folks
"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types
"Serves you right for ripping off creators!" -- copyright warriors
"They keep silently nerfing the models!" -- secret downgrade conspiracy theorists
"Quit killing the planet!" -- anti-datacenter advocates
Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.
For many developers, the agentic AI era began with Sonnet-class models: Claude Sonnet 3.5, 3.6, and 3.7 were the first models that showed impressive skills in coding and tool use. More recently, though, the clearest gains in agentic capabilities have been in our Opus-class models.
Sonnet 5 narrows the gap: its performance is close to that of Opus 4.8, but at lower prices. It’s a substantial improvement over its predecessor, Sonnet 4.6, on important aspects of agentic performance like reasoning, tool use, coding, and knowledge work:

Scores for Sonnet 5 on a variety of evaluations compared to those of Sonnet 4.6 and Opus 4.8 (a more generally capable model, for reference). The Claude Sonnet 5 System Card reports a broader set of evaluations in detail.
Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts. Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
From today, Claude Sonnet 5 is available across all plans: it is the default model for Free and Pro plans, and is available to Max, Team, and Enterprise users. It’s also available in Claude Code and on the Claude Platform, where it launches with introductory pricing of $2 per million input tokens and $10 per million output tokens through August 31, 2026, after which it will be priced at $3 per million input tokens and $15 per million output tokens. Developers can use claude-sonnet-5 via the Claude API.
The charts below compare the performance of Sonnet 5 with Sonnet 4.6 and Opus 4.8 at different effort levels on the agentic search evaluation BrowseComp and the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6 (gray line) and covers a much wider range of cost-performance options than Opus 4.8 (yellow line). It provides substantially improved cost efficiency at medium effort; its higher-effort performance can match Opus 4.8 on some tasks. Between Sonnet 5 and Opus 4.8, users can adjust the effort level to find the right balance of cost and performance.
Feedback from our early access partners has been consistent: Sonnet 5 is much more agentic than its predecessors. Testers described how it finishes complex tasks where previous Sonnet models would stop short, how it checks its own output without explicitly being asked, and how it does all this agentic work at an attractive price point:
Claude Sonnet 5 gives our agents a strong execution layer for multi-step software engineering work. It handles sustained coding, tool use, and debugging well across messy technical contexts, and has been especially useful for workflows where follow-through and technical grounding matter.
We handed Claude Sonnet 5 a two-part job—update Salesforce account tiers, send a launch announcement to enterprise contacts—and it finished end to end. That used to stall halfway. For day-to-day automation, it’s a no-brainer.
Claude Sonnet 5 gets more done with less. Same output quality, fewer steps to get there. It refuses unsafe requests cleanly and consistently, too. At Lovable, we’re putting powerful tools in the hands of millions of builders. A model that knows when to say no is just as important as one that knows how to build.
We ran Claude Sonnet 5 against dozens of our most challenging real pull requests, and it carried each one through to a tested, verified result on its own — freeing our engineers to focus on the judgment, the decision, and the final sign-off.
I asked Claude Sonnet 5 to investigate a bug. Unprompted, it wrote a reproducing test, implemented the fix, then stashed it to confirm the bug came back without the change. All in a single pass.
With Claude Sonnet 5, agents stay on plan, follow our conventions, and ship clean multi-step changes, all at an efficient cost.
Claude Sonnet 5 is at its best on brownfield code—race conditions, hidden tests, the parts nobody wants to touch. It traces a failure to its actual root cause and ships a durable fix instead of patching the symptom.
Claude Sonnet 5 sits on the Pareto frontier for Eve’s plaintiff-law tasks. We see the clearest gains in legal research and analysis, at a price-to-performance ratio that made the choice to migrate easy.
ClickHouse agents explore live data and produce insights on the fly, so time-to-insight matters when testing new models. Claude Sonnet 5 reasons in tighter steps and gets our users to answers noticeably faster. That speed is a difference our customers feel.
At Pace, our computer-use agents run insurance workflows—submission intake, FNOL, loss runs—on the systems our operations teams already use. Claude Sonnet 5 consistently takes the right action and does it quickly, which is what real insurance work demands.
01 /
10
Our pre-deployment safety evaluations found that Sonnet 5 was overall an improvement on Sonnet 4.6. On agentic safety, the model is better at refusing malicious requests and resisting hijack attempts in prompt injection attacks. The model shows lower rates of hallucination and sycophancy than Sonnet 4.6. On our automated behavioral audit, which tests a wide range of misaligned behaviors such as cooperation with misuse and deception, Sonnet 5 scored lower (that is, safer) overall. However, it did show somewhat higher rates of misaligned behavior on this assessment compared to the more capable Opus 4.8 and Claude Mythos Preview.

Rates of misaligned behavior on our automated behavioral audit, which tests for a very wide range of undesirable behaviors across many situations and contexts (see Section 6.4 of the Sonnet 5 System Card for a complete list and results for each specific behavior). Sonnet 5 shows an overall lower rate of misaligned behavior than Sonnet 4.6, though a higher rate than Mythos Preview and Opus 4.8.
We did not deliberately train Sonnet 5 on cybersecurity tasks. It can perform some routine, non-harmful cyber tasks, but on evaluations testing potentially dangerous cyber skills, such as developing software exploits, it shows substantially poorer performance than models such as Opus 4.8 and Mythos 5. Scores from one evaluation, which tested models’ ability to develop exploits for vulnerabilities in the Firefox browser, are shown in the chart below. Sonnet 5 was never able to develop a full working exploit, but it does show a slightly higher rate of partial success than Sonnet 4.6. This latter change is likely due to improvements in general intelligence rather than specific training.

Scores measuring models’ success at developing exploits for software vulnerabilities in Firefox 147 (this evaluation was developed in collaboration with Mozilla; all vulnerabilities have been patched in Firefox 148). For each model, the left-hand bar shows how often the model (without safeguards) developed a working exploit; the right-hand bar shows how often the model had partial success. Neither of the Sonnet models could successfully develop a working exploit (both scored 0.0%); Sonnet 5 showed a slightly higher partial success rate than Sonnet 4.6. Both Sonnet models have substantially poorer cyber capabilities than Opus 4.8 and Mythos 5. For full details, see Section 3.2.4 of the Sonnet 5 System Card.
Since Sonnet 5 is somewhat stronger than its predecessor on these tasks, we’ve launched it with cyber safeguards enabled by default. These safeguards—which detect and block dangerous cyber usage in real time—are the same as those present in Claude Opus 4.7 and 4.8 (because we judged that the overall level of cybersecurity risk from Sonnet 5 was low, the safeguards are less strict than those launched with Fable 5, which block a much wider range of cybersecurity tasks).1
Our full assessment of Sonnet 5 across many safety and capability evaluations is reported in the Claude Sonnet 5 System Card.
Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2 We’ve increased rate limits across Chat, Cowork, Claude Code, and the Claude Platform3 to accommodate the higher token usage of higher effort levels; users can select whichever level makes sense for their particular project.
Edit June 30, 2026: In the original version of this post, we included a cost-performance chart for the BrowseComp evaluation that was based on data from a simpler methodology that did not reflect the standard methodology we use for agentic search evaluations. This had the result of underestimating Sonnet 5's performance on the evaluation.
We have now updated the chart so that it matches the methodology that we used and discussed in the Sonnet 5 system card (which used a 10M token budget with compaction and programmatic tool calling). We have also updated the surrounding text.
1 Sonnet 5 is part of our Cyber Verification Program, which is available today on the native Claude Platform, the Claude Platform on AWS, and Claude in Microsoft Foundry (hosted on Azure and Anthropic), and coming soon on Claude in Google Vertex. Organizations that are already enrolled in the Cyber Verification Program automatically have the same access on Sonnet 5, with no need to reapply. Overall, we recommend Claude Opus 4.8 for cybersecurity work that requires reduced guardrails.
2 Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral.
3 On April 26, 2026, we raised Sonnet and Haiku rate limits at every usage tier and simplified to three tiers (Start, Build, and Scale) on the native Claude Platform. You can view your tier and current limits in the Claude Console or read the documentation to learn more.
Claude Science is a customizable app that integrates the tools and packages researchers most often use, produces auditable artifacts, and provides flexible access to computing resources.
Claude Tag is a new way for teams to work with Claude.
https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4...
OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks).
IMO, they were quite good with checklists even a year ago, and tried to tick off each one.
They are often used for reading code though.
To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.
In theory the flow works like this:
- small fast models read lots of code, and pass details to the large model to write a plan
- large model takes those details and writes a detailed plan
- medium models write the code
The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.
If it guesses, the plan usually starts to fall to bits.
If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)
It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.
I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.
This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K...
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest...
Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities.
Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on).
And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens).
Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said).
So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk.
I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field?
That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house.
I think there is. Pair today doesn’t mean they’re locked into that forever.
For the non-bleeding edge they have a lot of competition with more competitors showing up every day.
The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well.
Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.
(ie. won't feel the need to downvote them just for having yet another crappy AI benchmark)
I only recognize it because I build a product that leaves me looking for information on every major release... and every major release a new crop of folks reply confused about the anomalies on top of anomalies that they're seeing, and they slowly learn this person is just way more unserious than the dogged distribution would imply.
Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000.
Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed.
$1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally.
So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model.
Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then.
The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings.
I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do.
> that is ~1.5 - 2.5T in value annually
That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best.
It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further.
The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash.
Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive.
Personally, for assisted development, I haven't seen much progress in a while.
Fable is effectively not available to the general public in the US either
It’s like telling a chef to cook without a knife because knives can kill people.
Dario and his lackeys at Anthropic aren’t visionaries.
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
Claude Code generates more revenue than OpenAI...It appears to be a nice meme.
I don't know whether that comes out ahead compared to just staying with the better model in the first place.
I don't agree with your framing that all negativity is from crazies
I think this sort of thinking is a trap, because it presumes that all software has the same constraints.
There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing".
I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine.
They won't justify trillions in investment, but they are useful.
Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage.
I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes.
Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it.
If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found.
If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have.
This is for native mobile app development, and the code base is around 100k LOC.
That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai.
Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp.
Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best.
<banned>
Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model?
Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.
Quite a lot of these models have "safety" (lol) filters in front of them, vs it being heavily encoded into the weights not.
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
However the result is exactly the same, concentration of power.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
Most recently Ideogram released an open weight model that will denoise into a grey image with the text "Blocked by safety filter" notice for certain prompts
Of course, because it's open weights people have found defeats
I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.
I'm sure folks' mileage will vary though.
However, these frontier labs are also making moves that could let them capture a disproportionate share of the upside. One possibility is a situation analogous to the smartphone manufacturing space, where there are dozens of players but just a handful (e.g. Apple, Samsung in smartphones) capture the lion's share of the revenue.
As you said, building a script that only you use personally or a very simple thing that just accomplishes one task and it’s easy to test require almost no engineering, and an LLM can often build those with very little downsides.
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.
So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.
They'll show up after the fact and whinge endlessly about how they should have been involved.
It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that.
So I dunno! Could go either way!
Are there some Less Wrong posts or similar I should read that probably explain it?
I'm talking back-end, with database models, classes, queries, accompanying front-end layouts, with real dynamic data, running. Stuff that takes days to weeks to spin up, with minimal errors or issues, having cut down on days or weeks of effort, you can focus on testing and making it all into better code.
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
Or maybe every cultural faction has its own set of whiners and the ones we disagree with always seem the loudest.
But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.
Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.
It feels like your analysis is mostly spot on, it's the confluence of several motivated parties pouring effort into social media.
Many of the posters are pro-foreign models/pro-open source, and most can't distinguish the difference between "open source" and open weight models like Qwen, Minimax, or GLM.
Reminds me of the old "free as in beer" vs "free as in speech" debate. Free beer means you don't pay, but you don't get to see the recipe or change it. Free speech means you get the actual source and the right to study it, modify it, and redistribute it.
Open weight models are basically the beer version. You can download the weights, run them locally, fine-tune them, quantize them, host them on your own boxes — but what you have is a finished product, not the blueprint for how it was built.
The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi.
What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO.
Everyone dislikes when these models are provided for use by the Department of Defense, but we can likely assume these newer, more capable models are being used by the NSA, FBI, CIA and other Five Eyes agencies to develop more backdoors, hack into more things to spy on us all.
We get drip fed the weaker models, but only once all the 0days have been used against us.
I am planning to release the steering patch for the GLM 5.2 eliminating pro-CCP alignment in the next few days.
Qwen is also censored - although since it's open weight, there are completely uncensored versions available.
The owners of Qwen can't jack up the prices to something I'm unable to pay. They can't take it away.
The owners of Qwen can't log and train on my data.
Open weight models share far more in common with free speech than free beer.
If big daddy Dario and his company are getting pushback it's not being of some motivated group trying to take them down. They brought it on themselves.
My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing.
But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions.
Also Heretic as it is does not work for GLM5.2 (at least as of 3 days ago when I tested it). You'll need some hybrid approaches.
If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.
(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)
It was a claim that applies to a range of x-values where both curves are defined.
Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?
You could make it true by artificially dropping some of the data points, but, like, why?
(Again, this is moot given the updated graph.)
> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.
Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.