anthropic doesn't have that. single provider, single pricing decision. whether or not $5k is accurate the more interesting question is what happens to inference pricing when the supply side is genuinely open. we're seeing hints of it with open router but its still intermediated
not saying this solves anthropic's cost problem, just that the "what does inference actually cost" question gets a lot more interesting when providers are competing directly
1. It would be nice to define terms like RSI or at least link to a definition.
2. I found the graph difficult to read. It's a computer font that is made to look hand-drawn and it's a bit low resolution. With some googling I'm guessing the words in parentheses are the clouds the model is running on. You could make that a bit more clear.
Anthropic's models may be similar in parameter size to model's on open router, but none of the others are in the headlines nearly as much (especially recently) so the comparison is extremely flawed.
The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch based on gear count.
It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.
That's why the difference between open router prices and those official providers isn't that different. Plus who knows what open routed providers do in term quantization. They may be getting 100x better efficiency, thus the competitive price.
That being said not all users max out their plan, so it's not like each user costs anthropic 5,000 USD. The hemoragy would be so brutal they would be out of business in months
If you remove the cached token cost from pricing the overall api usage drops from around $5000 to $800 (or $200 per week) on the $200 max subscription. Still 4x cheaper over API, but not costing money either - if I had to guess it's break even as the compute is most likely going idle otherwise.
I thought there was no moat in AI? Even being 10x costlier, Anthropic still doesn't have enough compute to meet demand.
Those "AI has no moat" opinions are going to be so wrong so soon.
Alibaba is the primary comparison point made by the author, but it's a completely unsuitable comparison. Alibab is closer to AWS then Anthropic in terms of their business model. They make money selling infrastructure, not on inference. It's entirely possible they see inference as a loss leader, and are willing to offer it at cost or below to drive people into the platform.
We also have absolutely no idea if it's anywhere near comparable to Opus 4.6. The author is guessing.
So the articles primary argument is based on a comparison to a company who has an entirely different business model running a model that the author is just making wild guesses about.
I think it's the other way around? Sparse use of GPU farms should be the more expensive thing. Full saturation means that we can exploit batching effects throughout.
Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.
Also, you can select BF16 or Q8 providers on openrouter.
I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.
I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.
So no, Claude would not be getting NEARLY as much usage as it's currently getting if it weren't for the $100/$200 monthly subscription. You're comparing Kimi to the price that most people aren't paying.
[1] https://www.wheresyoured.at/anthropic-is-bleeding-out/ [2] https://www.wheresyoured.at/costs/
Aren't they losing money on the retail API pricing, too?
> ... comparisons to artificially low priced Chinese providers...
Yeah, no this article does not pass the sniff test.
Are Anthropic currently unable to sell subscriptions because they don’t have capacity?
I mean... rolex is overpriced brand whose cost to consumers is mainly just marketting in itself. Its production cost is nowhere close to selling price and looking at gears is fair way of evaluating that
These are not cell phone plans which the average joe takes, they are plans purchased with the explicit goal of software development.
I would guess that 99 out of every 100 plans are purchased with the explicit goal of maxing them out.
Why would it go idle? It would go to their next best use. At least they could help with model training or let their researchers run experiments etc.
but $5 that I amortize over 7 years might end up being $1.7 maybe if I don't rapidly combust (supply chain risk)
Absolutely! Im currently paying $170 to google to use Opus in antigravity without limit in full agent mode, because I tried Anthropic $20 subscription and busted my limit within a single prompt. Im not gonna pay them $200 only to find out I hit the limit after 20 or even 50 prompts.
And after 2 more months my price is going to double to over $300, and I still have no intention of even trying the 20x Max plan, if its really just 20x more prompts than Pro.
No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.
For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852
When has production cost had anything to do with selling price?
I'm sure Anthropic is making money off the API but I highly doubt it's 90% profit margins.
When I have a feeling that these tools will speed me up, I use them.
My client pays for a couple of these tools in an enterprise deal, and I suspect most of us on the team work like that.
If my goal was to max out every tool my client pays, I’d be working 24hrs a day and see no sunlight ever.
I guess it’s like the all you can eat buffet. Everybody eats a lot, but if you eat so much that you throw up and get sick, you are special.
Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.
if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.
Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.
If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.
It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.
The Trillions of parameters claim is about the pretraining.
It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.
However those models end up very sparse and incredibly distillable.
And it’s way too expensive and slow to serve models that size so they are distilled down a lot.
42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B, which makes sense since denser models are easier to optimize. As a lower bound, we get 4576B / 42 ≈ 109B active parameters for Opus 4.6. (This makes the assumption that all three models use the same number of bits per parameter and run on the same hardware.)Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)
So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)
Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.
Maybe the common factor here is not having deep/sufficient knowledge on the topic being discussed? For the article I mentioned, I feel like I was less focused on the strength of the writing and more on just understanding the content.
LLMs are very capable at simplifying concepts and meeting the reader at their level. Personally, I subscribe to the philosophy of - "if you couldn't be bothered to write it, I shouldn't bother to read it".
Popular content is popular because it is above the threshold for average detection.
In a better world, platforms would empower defenders, by granting skilled human noticers flagging priority, and by adopting basic classifiers like Pangram.
Unfortunately, mainstream platforms have thus far not demonstrated strong interest in banning AI slop. This site in particular has actually taken moderation actions to unflag AI slop, in certain occasions...
> My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute.
Training currently requires nvidia's latest and greatest for the best models (they also use google TPU's now which are also technically the latest and greatest? However, they're more of a dual purpose than anything afaik so that would be a correct assesment in that case)
Inference can run on a hot potato if you really put your mind to it
I am not saying this would be a great use of their compute, but idle is far from the only alternative. (Unless electricity is the binding constraint?)
I'd say Opus is roughly 2x to 3x the price of the top Chinese models to serve, in reality.
Opus is 2T-3T in size at most.
So the article's title is obviously sensationalized.
Also, while Opus certainly is a lot better than even the best Chinese models, when I max out my Claude plan, I make do with Kimi 2.5. When factoring in the re-run of changes because of the lower quality, I'd spend maybe 2x as much per unit of work I were to pay token prices for all my monthly use w/Kimi.
I'd still prefer Claude if the price comes down to 1x, as it's less hassle w/the harder changes, but their lead is effectively less than a year.
“what X actually is”
“the X reality check”
Overuse of “real” and “genuine”:
> The real story is actually in the article. … And the real issue for Cursor … They have real "brand awareness", and they are genuinely better than the cheaper open weights models - for now at least. It's a real conundrum for them.
> … - these are genuinely massive expenses that dwarf inference costs.
This style just screams “Claude” to me.
It has enough tells in the correct frequency for me to consider it more than 50% generated.
Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.
However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.
This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.
My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute. The relevant quote:
Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company's compute spend patterns.
This is being shared as proof that Anthropic is haemorrhaging money on inference. It doesn't survive basic scrutiny.
I'm fairly confident the Forbes sources are confusing retail API prices with actual compute costs. These are very different things.
Anthropic's current API pricing for Opus 4.6 is $5 per million input tokens and $25 per million output tokens. At those prices, yes - a heavy Claude Code Max 20 user could rack up $5,000/month in API-equivalent usage. That maths checks out.[1]
But API pricing is not what it costs Anthropic to serve those tokens.
The best way to estimate what inference actually costs is to look at what open-weight models of similar size are priced at on OpenRouter - where multiple providers compete on price.
Qwen 3.5 397B-A17B is a good comparison point. It's a large MoE model, broadly comparable in architecture size to what Opus 4.6 is likely to be. Equally, so is Kimi K2.5 1T params with 32B active, which is probably approaching the upper limit of what you can efficiently serve.
Here's what the pricing looks like:
The Qwen 3.5 397B model on OpenRouter (via Alibaba Cloud) costs _$0.39_ per million input tokens and _$2.34_ per million output tokens. Compare that to Opus 4.6's API pricing of $5/$25. Kimi K2.5 is even cheaper at $0.45 per million input tokens and $2.25 output.
That's roughly 10x cheaper.
And this ratio holds for cached tokens too - DeepInfra charges $0.07/MTok for cache reads on Kimi K2.5 vs Anthropic's $0.50/MTok.
These OpenRouter providers are running a business. They have to cover their compute costs, pay for GPUs, and make a margin. They're not charities. If so many can serve a model of comparable size at ~10% of Anthropic's API price and remain in business, it is hard for me to believe that they are all taking enormous losses (at ~the exact same rate range).
If a heavy Claude Code Max user consumes $5,000 worth of tokens at Anthropic's retail API prices, and the actual compute cost is roughly 10% of that, Anthropic is looking at approximately $500 in real compute cost for the heaviest users.
That's a loss of $300/month on the most extreme power users - not $4,800.
However, most users don't come anywhere near the limit. Anthropic themselves said when they introduced weekly caps that fewer than 5% of subscribers would be affected. I personally use the Max 20x plan and probably consume around 50% of my weekly token budget and it's hard to use that many tokens without getting serious RSI. At that level of usage, the maths works out to roughly break-even or profitable for Anthropic. [2]
The real story is actually in the article. The $5,000 figure comes from Cursor's internal analysis. And for Cursor, the number probably is roughly correct - because Cursor has to pay Anthropic's retail API prices (or close to it) for access to Opus 4.6.
So to provide a Claude Code-equivalent experience using Opus 4.6, it would cost Cursor ~$5,000 per power user per month. But it would cost Anthropic perhaps $500 max.
And the real issue for Cursor is that developers want to use the Anthropic models, even in Cursor itself. They have real "brand awareness", and they are genuinely better than the cheaper open weights models - for now at least. It's a real conundrum for them.
Obviously Anthropic isn't printing free cashflow. The costs of training frontier models, the enormous salaries required to hire top AI researchers, the multi-billion dollar compute commitments - these are genuinely massive expenses that dwarf inference costs.
But on a per-user, per-token basis for inference? I believe Anthropic is very likely profitable - potentially very profitable - on the average Claude Code subscriber.
The "AI inference is a money pit" narrative is misinformation that actually plays into the hands of the frontier labs. If everyone believes that serving tokens is wildly expensive, nobody questions the 10x+ markups on API pricing. It discourages competition and makes the moat look deeper than it is.
If you want to understand the real economics of AI inference, don't take API prices at face value. Look at what competitive open-weight model providers charge on OpenRouter. That's a much closer proxy for what it actually costs to run these models - and it's a fraction of what the frontier labs charge.
A HN user claimed they were burning 150M-200M tok/day. Assuming a 95% cache hit rate and a 90% input/output ratio, this works out at somewhere between $400-$600/day in "API" costs, which is pretty much bang on the $5,000/month estimate ($4,200-$6,000). I got the cache hit rate stats and input/output breakdown from this blog and scaled it up for that usage. ↩︎
According to Anthropic's own /cost command data, the average Claude Code developer uses about $6/day in API-equivalent spend, with 90% under $12/day. That's $180/month average. At 10% actual cost, that's $18/month to serve - against a $20-200 subscription. ↩︎