An update on recent Claude Code quality reports

Bit surprised about the amount of flak they're getting here. I found the article seemed clear, honest and definitely plausible.

The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.

The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

They lost me at Opus 4.7

Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.

Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.

At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."

However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.

This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.

>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.

IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should.

PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently

" Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause"

I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.

I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

Are they also going to refund all the extra usage api $$$ people spent in the last month?

Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.

Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.

Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.

The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.

> Today we are resetting usage limits for all subscribers.

I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.

Will they restore my account usage limits? Since I no longer have Max?

Is that one week usage restored, or the entire buggy timespan?

I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?

Damn it was real the whole time. I found Opus 4.7 to holistically underperform 4.6, and especially in how much wordiness there is. It's harder to work with so I just switched back to 4.6 + Kimi K2.6. Now GPT 5.5 is here and it's been excellent so far.

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

>On April 16, we added a system prompt instruction to reduce verbosity

In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.

At least tell users when the system prompt has changed.

Some people seem to be suggesting these are coverups for quantization...

Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.

I would not suspect quantization before I would suspect harness changes.

I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.

Claude caveman in the system prompt confirmed?

This is a very interesting read on failure modes of AI agents in prod.

Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.

Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?

Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.

I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.

yesterday CC created a fastapi /healthz endpoint and told me it's the gold standard (with the ending z). today I stopped my max sub and will be trying codex

> As of April 23, we’re resetting usage limits for all subscribers.

Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!

> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.

This sounds fishy. It's easy to show users that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.

ugh, caching based on idle time is horrible for my usage anyway; since claude is both fairly slow and doesn't really have much of a daily quota anyway I often tell it to do something and then wander off and come back to check on it when I next think about it. I always vaguely assumed that my session would not "detect" the intervening time anyway since it was all async. I guess from a global perspective time-based cache eviction makes sense.

The third bug is the one worth dwelling on. Dropping thinking blocks every turn instead of just once is the kind of regression that only shows up in production traffic. A unit test for "idle-threshold clearing" would assert "was thinking cleared after an hour of idle" (yes) without asserting "is thinking preserved on subsequent turns" (no). The invariant is negative space.

The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.

> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"

Do researchers know correlation between various aspects of a prompt and the response?

LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.

Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.

Opus 4.7 is very rough to work with. Specifically for long-horizon (we were told it was trained specifically for this and less handholding).

I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.

Not saying it's a bad model; it's just not simple to work with.

for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])

It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.

The artificial creation of demand is also a concerning sign.

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.

The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...

YES... DO IT... FRICKING MACHINE..

1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)

2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

3. System prompt to make Claude less verbose reducing coding quality (4 days - better)

All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.

However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.

Doing this proactively would certainly match expectations for a fast-moving product like this.

What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)

Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

They had this ready and timed it for GPT 5.5 announcement. Zero chance it's a coincidence .

Those are exactly the kind of issues you run into when your app is ai coded you built one thing and kill something else.

You have too many and the wrong benchmarks

One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.

But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.

Did they not address how adaptive thinking has played in to all of this?

An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.

Appreciate the honesty from the team.

At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.

> On April 16, we added a system prompt instruction to reduce verbosity.

What verbosity? Most of the time I don’t know what it’s doing.

> All three issues have now been resolved as of April 20 (v2.1.116).

The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?

To think we'd have known about this in advance if they'd just have open sourced Claude Code, rather than them being forced into this embarrassing post mortem. Sunlight is the best disinfectant.

had this happen to me mid-refactor and spent 20 min wondering if I'd gone crazy. honestly the one hour threshold feels pretty arbitrary, sometimes you just step away to think

something i note from this is that this is not a model weights change, but it is a hidden state change anthropic is doing to the outputs that can tune the quality and down on the "model" without breaking the "we arent changing the model" promise.

how often do these changes happen?

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.

As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.

The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?

It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.

Did they not address how adaptive thinking has played in to all of this?

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."

I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.

In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.

Check that you’re running the latest version.

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.

Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.

The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.

so who do you trust and go to? (NotClearlySo)OpenAI?

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Claude caveman in the system prompt confirmed?

I've recently been introduced to that plugin, love it for humour

YES... DO IT... FRICKING MACHINE..

https://marginlab.ai/ (no affiliation)

There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.

One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt

Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.

1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)

2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

3. System prompt to make Claude less verbose reducing coding quality (4 days - better)

All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.

Doing this proactively would certainly match expectations for a fast-moving product like this.

> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!

Seems like a very basic software engineering error that would be caught by normal unit testing.

To be fair to Anthropic, they did not intentionally degrade performance.

To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.

None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.

> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).

However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.

The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?

so who do you trust and go to? (NotClearlySo)OpenAI?

I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.

I'm using Zed and Claude Code as my harnesses.

At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.

However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode

Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.

To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.

I've recently been introduced to that plugin, love it for humour

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:

> Next steps are to run `cat /path/to/file` to see what the contents are

Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).

That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.

Just the other day it was in Auto mode (by accident) and I told it:

> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.

And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).

The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".

If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).

If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:

1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that

2. Dumb the models down (basically decreasing their cost per token)

3. Send less tokens (ie capping thinking budgets aggressively).

2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.

I would love if agents would act way more like tools/machines and NOT try to act as if they were humans

https://marginlab.ai/ (no affiliation)

There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.

> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!

Seems like a very basic software engineering error that would be caught by normal unit testing.

To be fair to Anthropic, they did not intentionally degrade performance.

To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.

None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.

Sure, but it gives the impression of degraded model performance. Especially when the interface is still saying the model is operating on "high", the same as it did yesterday, yet it is in "medium" -- it just looks like the model got hobbled.

> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.

Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.

Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.

In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.

Claude code is not infra, the model is the infra. They changed settings to make their models faster and probably cheaper to run too. Honestly with adaptive thinking it no longer matters what model it is if you can dynamically make it do less or more work.

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.

At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.

I pay for 20x max and get so much more value out of it than I pay.

It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.

But if a tool is better, it's better.

What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?

We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?

> It’s incredible how forgiving you guys are with Anthropic and their errors.

Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.

I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.

The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.

That said, there is now much better competition with Codex, so there's only so much rope they have now.

Because it is still good though.

If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.

There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.

Also moving fastish means having more/better models faster.

I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.

At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.

Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released

I pay for 20x max and get so much more value out of it than I pay.

That said, there is now much better competition with Codex, so there's only so much rope they have now.

Because it is still good though.

There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.

Also moving fastish means having more/better models faster.

I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.

Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released

The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.

I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.

I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.

I'm using Zed and Claude Code as my harnesses.

At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.

However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode

Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.

I like chutes because they always use the full weights, and prompts are encrypted with TEE.

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.

Enough that the prompt is different at a token-level, but not enough that the meaning changes.

It would be very difficult for them to catch that, especially if the prompts were not made public.

Run the variations enough times per day, and you'd get some statistical significance.

The guess the fuzzy part is judging the output.

Yes, but for many users, CC is the product. Especially since I'm not allowed(?) to use my own harness with my sub.

Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.

Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.

In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

Everything else aside, their brief "experiment" with removing CC support from the Pro plan got me seriously considering other options. I've been wary of vendor lock-in the whole time, but it was a useful reminder. (opencode+openrouter will probably be my first port of call)

never ever forget theo's gpt 5 hype video and then him having to walk it back.

its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.

literally just `git reset --hard <random hash from 3 months ago>` would fix this

A valid choice, a moral choice, is none of the above.

But if a tool is better, it's better.

You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness

At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?

Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.

HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.