Statistically, probably likely that the dip occurred at a point that wasn't too important? But what happens if the idiot comes out at a critical point?
Kind of reminds me of the two alternate ways that time travel works in sci-fi. Does the small change to the past explode like a fission reaction, or does history heal itself?
Anywho, if errors do accumulate, I can see being very pissed off even with temporary idiocy from the model, as it means it poisons the context for the entire rest of the conversation.
Calling the platforms A, B and C might help provide us the insight we're missing to spot incongruous behaviors faster than trying to aggregate more generalized feedback.
According to reports, users did not stop coming back even when the app was broken for hours.
A similar thing happened to me when playing some initial version of The Binding of Isaac on Linux, when it was made with Flash. Its performance wasn't the best but I couldn't stop playing.
So if people still returns maybe Anthropic has something great going on with Claude Code.
[1]: https://www.theguardian.com/technology/2016/jan/05/facebook-...
Although unit testing an entire LLM is not really feasible right now, all these bugs were in small deterministic parts of the system. Load balancing, top-k probability calculations and so on are all engineered parts no different to other software, and should in principle all be unit testable. At most you need an injectable PRNG. Yes, non-deterministic optimization bugs are awful but I've personally found compiler and database bugs in the past using just regular app test suites. With CI you get a lot of runs so rare events can still surface as long as you investigate flakes. One of my current projects runs every unit test in the same process in parallel, which has proven an excellent and cheap strategy for flushing out rare thread safety issues and database deadlocks.
A few days ago I commented on a thread about the Java launch that people often feel Java is "enterprisey" compared to Python because Java code is typically written to be heavily unit testable. A lot of abstraction is driven by the desire for dependency injection, for example. I contrasted that to scripting language culture where I've found testing is often either missing or kinda surface level (e.g. mostly just asserting on types).
When I've been learning PyTorch a few years ago I noticed the same thing. The tutorials took you from simple to complex stuff without talking much about how you test or best structure the code. This makes sense for ML research where you don't have a clear goal and success boils down to maxing a score in some kind of eval, but it doesn't make sense for production deployment at scale.
I wonder if the AI labs could use more people with SRE and HA SWE background to focus on things like this. I'm kinda skeptical that more aggressive rolling evals-in-prod are the best way to avoid bugs like these happening again.
> Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback.
Ok makes sense and glad to hear
> It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug command in Claude Code
Ok makes sense and I’d expect that a human can then see the context in that case although I hope it is still very explicit to the end user (I’m not a Claude Code user so I cannot comment)
> or you can use the "thumbs down" button in the Claude apps to do so
This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.
Interesting, this implies that the 1M context servers performs worst at low context. Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?
Despite this I'm still a paying customer because Claude is a fantastic product and I get a lot of value from it. After trying the API it became a no brainer to buy a 20x Max membership. The amount of stuff I've gotten done with Claude has been awesome.
The last several weeks have strongly made me question my subscription. I appreciate the openness of this post, but as a customer I'm not happy.
I don't trust that these issues are all discovered and resolved yet, especially the load balancing ones. At least anecdotally I notice that around 12 ET (9AM pacific) my Claude Code sessions noticeably drop in quality. Again, I hope the team is able to continue finding and fixing these issues. Even running local models on my own machine at home I run into complicated bugs all the time — I won't pretend these are easy problems, they are difficult to find and fix.
Can anyone explain to a layperson how this sort of thing is even possible for an LLM?
For normal code, of course stupid bugs happen all the time. You accidentally introduce an off-by-one error in a conditional, for example, or add an extra `goto fail`.
But LLMs aren't written by humans! Models are trained by automated programs over a period of many months across unfathomably massive data centers.
How would a human introduce a bug like the one described in TFA?
Matches my experience. I use CC through our enterprise Vertex AI account and never noticed any degradation.
In general it seems like these bugs, while serious, were substantially less prevalent than anecdotal online reports would have you believe. We are really talking about a ~1-2 week window here where most issues were concentrated, a relatively small percentage of total requests and total users impacted.
[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
Was it 1% worse / unnoticeable? Did it become useless? The engineering is interesting but I'd like to see it tied to actual impact
I know I'll probably get push back on this, but it left a sour taste in my mouth when I paid for a $200 sub that felt like it was less useful than ChatGPT Plus ($20) at times.
Or to summarize: [south park "we're sorry" gif]
In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.
The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.
- Started small, worsened after load-balancing changes.
2. *Output corruption*
- TPU misconfigurations led to weird outputs (wrong language, syntax errors). - Runtime optimizations wrongly boosted improbable tokens.
3. *Approximate top-k miscompilation*
- A compiler bug in TPU/XLA stack corrupted token probability selection. - Occasionally dropped the true top token.
Why It Was Hard to Detect
- Bugs were subtle, intermittent, and platform-dependent.- Benchmarks missed these degradations.
- Privacy/safety rules limited access to real user data for debugging.
Fixes and Next Steps - More sensitive, continuous evals on production.
- Better tools to debug user feedback safely.
- Stronger validation of routing, output correctness, and token-selection.
Claude code made almost half a billion so far[1] (>500m in ARR and its like 9 months old) , and 30% of all users have been impacted at least once, just from the first routing bug. Scary stuff.
Their post mortem is basically "evaluations are hard, we relied on vibe checking, now we are going to have even more frequent vibe checking". I believe it was indeed unintentional, but in the future where investor's money wont come down from the skies, serving distilled models will be very tempting. And you can not be liable to any SLA currently, it's just vibes. I wonder how enterprise vendors are going to deal with this going forward, you cannot just degrade quality without client or vendor even being able to really prove it.
[1][https://www.anthropic.com/news/anthropic-raises-series-f-at-...]
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.
A google search isn't deterministic. Neither is loading upvote count on social media.
It's common advice in distributed systems to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.
So to be fair, you are getting exactly what you paid for - a non-deterministic set of generated responses of varying quality and accuracy.
https://www.reddit.com/r/singularity/comments/1khxwjh/claude...
“I refuse to believe what the people who would know the best said, for no real reason except that it doesn’t feel right” isn’t exactly the level of considered response we’re hoping for here on HN. :)
Do their ToS really limit access to user data (prompt/response)? I don't remember seeing anything to that effect in their terms.
Layered in aggrandizing. You host a service, people give you money.
> I’m pretty surprised that Anthropic can directly impact the infra for AWS Bedrock as this article suggests.
We don't directly manage AWS Bedrock deployments today, those are managed by AWS.
> I can’t imagine the average person equates hitting this button with forfeiting their privacy.
We specify
> Submitting this report will send the entire current conversation to Anthropic for future improvements to our models.
in the thumbs down modal. Is there a straightforward way to improve this copy?
Even more than that, AI tends to mock _everything_. Mocking is useful, but the more real code a unit test invokes, the better, because the risk is not only the code itself but its interactions, the interface. Yet AI in Python will mock so heavily it barely tests even the code itself, with tautological statements.
I've prompted with heavy warnings against mocking and pointing directly at examples of thorough tests as examples. FWIW, Python does have excellent tools for injection and can write really nicely structured code.
When you click "thumbs down" you get the message "Submitting this report will send the entire current conversation to Anthropic for future improvements to our models." before you submit the report, I'd consider that pretty explicit.
If you trust this OpenRouter data the uptime record of these APIs is... not good to say the least: https://openrouter.ai/openai/gpt-5/uptime
It's clear to me that every provider is having enormous scale challenges. Claude Code often slows to a crawl and I have to interrupt it and tell it to try again.
This is especially pronounced around 4-6pm UK time (when we have Europe, Eastern US and West Coast US all hammering it).
Even today I was getting 503 errors from Gemini AI studio with model overloaded at that time, nothing on status page.
I really wonder if it would be worth Claude et al offering a cheaper off peak plan, to try and level out demand. Perhaps the optics of that don't look good though.
Edit to add: I think another potential dimension to this is GB200s have been a lot slower to come on stream than probably the industry expected. There's been a lot of defects with various hardware and software components and I suspect the liquid cooling has been difficult to get right (with far more catastrophic failure states!).
Doesn't that say it all? At this point the quality of the AI trumps reliability for the customer (you and me), so even though of course they should (and I'm sure will) focus on it, why would they prioritise reliability over model quality right now?
At this point, I'd be surprised if the different vendors on openrouter weren't abusing their trust by silently dropping context/changing quantization levels/reducing experts - or other mischievous means of delivering the same model at lower compute.
e.g. S3 has many times encountered increased error rate but doesn't report. No one says anything about S3.
People will say many things, but their behaviour is to reward the lie. Every growth hack startup guy knows this already.
As you discuss, training happens over a long period of time in a (mostly) hands-off fashion once it starts.
But inference? That’s a separate process which uses the trained model to generate responses, and it’s a runtime process - send a prompt, inference runs, response comes back. That’s a whole separate software stack, and one that is constantly being updated to improve performance.
It’s in the inference process where these issues were produced.
> to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.
What does this even mean? I see no incompatibility between determinism and your ability to perform the same function more slowly. Determinism just means that the output of the system is solely dependent on the inputs - feed the same inputs get the same outputs. If by degraded state you’re intentionally choosing to change your inputs, that doesn’t change the determinism of your system.
When it is said that LLMs aren’t deterministic, it’s because the output token is dependent on the inputs context and all other contexts processed in the same batch because the kernels are written non-deterministically. If the kernels were written deterministically (so that the output only depended on your input context), then there wouldn’t be a problem and it also wouldn’t change the ability for the system to degrade; it would be deterministic because capturing the input context and random seed would be sufficient. As it stands you’d have to capture the interim states of the other inputs being processed in the same batch and that interim state problem is what makes it non deterministic.
As for Google search, it’s not clear to me it’s non-deterministic. When you Google the exact same thing twice you get exactly the same page of results and selected snippets. That suggests there’s more determinism in the system than you’re giving it credit for.
There's a thousand and one reasons why a company valued in the billions, with the eyes of the world watching, would not be completely honest in their public response.
We're firmly in the realms of 'this thing is kind of smarter / faster at a task compared to me my employees, so I am contracting it to do that task'.
That doesn't mean 'if it fails, no payment'.
But I think it's too analogous to non-tech-products to hide behind a 'no refunds' policy. It's that good - there are consequences for it, I think.
It's a material difference in the product, not just "a bug."
That was my understanding before this article. But the article is pretty clear that these were "infrastructure bugs" and the one related to AWS Bedrock specifically says it was because "requests were misrouted to servers". If Anthropic doesn't manage the AWS Bedrock deployments, how could it be impacting the load balancer?
It seems there's a couple of dependency injection frameworks but they're clones of what's found in Java, right down to the type names. One of them even calls injectable objects beans! (Rhazes)
[1] Here is an example of two common approaches: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...
> Approximately 30% of Claude Code users had at least one message routed to the wrong server type, resulting in degraded responses.
> However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.
30% of Claude Code users getting a degraded response is a huge bug.
imho there's a big market gap for companies that are truly honest with customers instead of corporate gaslighting
I've honestly received the best results in creative writing by ignoring top_k/top_p and simply tuning temperature. Restricting my output to only common words causes everything to feel generic. But Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go to 1.14.
This isn't related to the "recent issues" but I feel like it's useful advice for anyone trying out AI story creation.
I do think an independent service status monitor might be an easier stip-gap and could serve to improve honesty. It's not trivial.
the blog explains what issues they had and how they fixed them. this is good enough.
I would have appreciated if they had released the full distribution of impact though.
My criticism is it's 'puffy'. The 'scope and complexity' for a public postmortem is 'customer-facing'. Otherwise it's a tree/forest scenario.
One might say 'the lady doth protest too much'; this should be routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your decade.
They don't give an upper bound though. 30% had at least one message degraded. Some proportion of that 30% (maybe most of them?) had some larger proportion of their messages (maybe most of them?) degraded. That matters, and presumably the reason we're not given those numbers is that they're bad.
How many users forget they have a sub? How many get a sub through work and don't use it often?
I'd bet a large number tbh based on other subscription services.
Regardless of whether it’s to save money, it’s purposefully inaccurate:
“When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution.”
I think the reason for this is that if you were to always choose the highest probable next word, you may actually always end up with the wrong answer and/or get stuck in a loop.
They could sandbag their quality or rate limit, and I know they will rate limit because I’ve seen it. But, this is a race. It’s not like Microsoft being able to take in the money for years because people will keep buying Windows. AI companies can try to offer cheap service to government and college students, but brand loyalty is less important than selecting the smarter AI to help you.
No, it's just the definition of sampling at non-zero temperature. You can set T=0 to always get the most likely token. Temperature trades of consistency for variety. You can set T to zero in the API, I assume the defaults for Claude code and their chat are nonzero.
You are absolutely right! Greedy decoding does exactly that for longer seqs: https://huggingface.co/docs/transformers/generation_strategi...
Interestingly DeepSeek recommends a temperature of 0 for math/coding, effectively greedy.
Between August and early September, three infrastructure bugs intermittently degraded Claude's response quality. We've now resolved these issues and want to explain what happened.
In early August, a number of users began reporting degraded responses from Claude. These initial reports were difficult to distinguish from normal variation in user feedback. By late August, the increasing frequency and persistence of these reports prompted us to open an investigation that led us to uncover three separate infrastructure bugs.
To state it plainly: We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.
We recognize users expect consistent quality from Claude, and we maintain an extremely high bar for ensuring infrastructure changes don't affect model outputs. In these recent incidents, we didn't meet that bar. The following postmortem explains what went wrong, why detection and resolution took longer than we would have wanted, and what we're changing to prevent similar future incidents.
We don't typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.
We serve Claude to millions of users via our first-party API, Amazon Bedrock, and Google Cloud's Vertex AI. We deploy Claude across multiple hardware platforms, namely AWS Trainium, NVIDIA GPUs, and Google TPUs. This approach provides the capacity and geographic distribution necessary to serve users worldwide.
Each hardware platform has different characteristics and requires specific optimizations. Despite these variations, we have strict equivalence standards for model implementations. Our aim is that users should get the same quality responses regardless of which platform serves their request. This complexity means that any infrastructure change requires careful validation across all platforms and configurations.
Illustrative timeline of events on the Claude API. Yellow: issue detected, Red: degradation worsened, Green: fix deployed.
The overlapping nature of these bugs made diagnosis particularly challenging. The first bug was introduced on August 5, affecting approximately 0.8% of requests made to Sonnet 4. Two more bugs arose from deployments on August 25 and 26.
Although initial impacts were limited, a load balancing change on August 29 started to increase affected traffic. This caused many more users to experience issues while others continued to see normal performance, creating confusing and contradictory reports.
Below we describe the three bugs that caused the degradation, when they occurred, and how we resolved them:
On August 5, some Sonnet 4 requests were misrouted to servers configured for the upcoming 1M token context window. This bug initially affected 0.8% of requests. On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected.
Approximately 30% of Claude Code users who made requests during this period had at least one message routed to the wrong server type, resulting in degraded responses. On Amazon Bedrock, misrouted traffic peaked at 0.18% of all Sonnet 4 requests from August 12. Incorrect routing affected less than 0.0004% of requests on Google Cloud's Vertex AI between August 27 and September 16.
However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.
Resolution: We fixed the routing logic to ensure short- and long-context requests were directed to the correct server pools. We deployed the fix on September 4. A rollout to our first-party platforms and Google Cloud’s Vertex was completed by September 16. The fix is in the process of being rolled out on Bedrock.
On August 25, we deployed a misconfiguration to the Claude API TPU servers that caused an error during token generation. An issue caused by a runtime performance optimization occasionally assigned a high probability to tokens that should rarely be produced given the context, for example producing Thai or Chinese characters in response to English prompts, or producing obvious syntax errors in code. A small subset of users that asked a question in English might have seen "สวัสดี" in the middle of the response, for example.
This corruption affected requests made to Opus 4.1 and Opus 4 on August 25-28, and requests to Sonnet 4 August 25–September 2. Third-party platforms were not affected by this issue.
Resolution: We identified the issue and rolled back the change on September 2. We've added detection tests for unexpected character outputs to our deployment process.
On August 25, we deployed code to improve how Claude selects tokens during text generation. This change inadvertently triggered a latent bug in the XLA:TPU[1] compiler, which has been confirmed to affect requests to Claude Haiku 3.5.
We also believe this could have impacted a subset of Sonnet 4 and Opus 3 on the Claude API. Third-party platforms were not affected by this issue.
Resolution: We first observed the bug affecting Haiku 3.5 and rolled it back on September 4. We later noticed user reports of problems with Opus 3 that were compatible with this bug, and rolled it back on September 12. After extensive investigation we were unable to reproduce this bug on Sonnet 4 but decided to also roll it back out of an abundance of caution.
Simultaneously, we have (a) been working with the XLA:TPU team on a fix for the compiler bug and (b) rolled out a fix to use exact top-k with enhanced precision. For details, see the deep dive below.
To illustrate the complexity of these issues, here's how the XLA compiler bug manifested and why it proved particularly challenging to diagnose.
When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution. We use "top-p sampling" to avoid nonsensical outputs—only considering words whose cumulative probability reaches a threshold (typically 0.99 or 0.999). On TPUs, our models run across multiple chips, with probability calculations happening in different locations. To sort these probabilities, we need to coordinate data between chips, which is complex.[2]
In December 2024, we discovered our TPU implementation would occasionally drop the most probable token when temperature was zero. We deployed a workaround to fix this case.
Code snippet of a December 2024 patch to work around the unexpected dropped token bug when temperature = 0.
The root cause involved mixed precision arithmetic. Our models compute next-token probabilities in bf16 (16-bit floating point). However, the vector processor is fp32-native, so the TPU compiler (XLA) can optimize runtime by converting some operations to fp32 (32-bit). This optimization pass is guarded by the xla_allow_excess_precision
flag which defaults to true.
This caused a mismatch: operations that should have agreed on the highest probability token were running at different precision levels. The precision mismatch meant they didn't agree on which token had the highest probability. This caused the highest probability token to sometimes disappear from consideration entirely.
On August 26, we deployed a rewrite of our sampling code to fix the precision issues and improve how we handled probabilities at the limit that reach the top-p threshold. But in fixing these problems, we exposed a trickier one.
Code snippet showing a minimized reproducer merged as part of the August 11 change that root-caused the "bug" being worked around in December 2024. In reality, it’s expected behavior of the xla_allow_excess_precision
flag.
Our fix removed the December workaround because we believed we'd solved the root cause. This led to a deeper bug in the approximate top-k operation—a performance optimization that quickly finds the highest probability tokens.[3] This approximation sometimes returned completely wrong results, but only for certain batch sizes and model configurations. The December workaround had been inadvertently masking this problem.
Reproducer of the underlying approximate top-k bug shared with the XLA:TPU engineers who developed the algorithm. The code returns correct results when run on CPUs.
The bug's behavior was frustratingly inconsistent. It changed depending on unrelated factors such as what operations ran before or after it, and whether debugging tools were enabled. The same prompt might work perfectly on one request and fail on the next.
While investigating, we also discovered that the exact top-k operation no longer had the prohibitive performance penalty it once did. We switched from approximate to exact top-k and standardized some additional operations on fp32 precision.[4] Model quality is non-negotiable, so we accepted the minor efficiency impact.
Our validation process ordinarily relies on benchmarks alongside safety evaluations and performance metrics. Engineering teams perform spot checks and deploy to small "canary" groups first.
These issues exposed critical gaps that we should have identified earlier. The evaluations we ran simply didn't capture the degradation users were reporting, in part because Claude often recovers well from isolated mistakes. Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback. This protects user privacy but prevents engineers from examining the problematic interactions needed to identify or reproduce bugs.
Each bug produced different symptoms on different platforms at different rates. This created a confusing mix of reports that didn't point to any single cause. It looked like random, inconsistent degradation.
More fundamentally, we relied too heavily on noisy evaluations. Although we were aware of an increase in reports online, we lacked a clear way to connect these to each of our recent changes. When negative reports spiked on August 29, we didn't immediately make the connection to an otherwise standard load balancing change.
As we continue to improve our infrastructure, we're also improving the way we evaluate and prevent bugs like those discussed above across all platforms where we serve Claude. Here's what we're changing:
Evals and monitoring are important. But these incidents have shown that we also need continuous signal from users when responses from Claude aren't up to the usual standard. Reports of specific changes observed, examples of unexpected behavior encountered, and patterns across different use cases all helped us isolate the issues.
It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug
command in Claude Code or you can use the "thumbs down" button in the Claude apps to do so. Developers and researchers often create new and interesting ways to evaluate model quality that complement our internal testing. If you'd like to share yours, reach out to feedback@anthropic.com.
We remain grateful to our community for these contributions.
[1] XLA:TPU is the optimizing compiler that translates XLA High Level Optimizing language—often written using JAX—to TPU machine instructions.
[2] Our models are too large for single chips and are partitioned across tens of chips or more, making our sorting operation a distributed sort. TPUs (just like GPUs and Trainium) also have different performance characteristics than CPUs, requiring different implementation techniques using vectorized operations instead of serial algorithms.
[3] We had been using this approximate operation because it yielded substantial performance improvements. The approximation works by accepting potential inaccuracies in the lowest probability tokens, which shouldn't affect quality—except when the bug caused it to drop the highest probability token instead.
[4] Note that the now-correct top-k implementation may result in slight differences in the inclusion of tokens near the top-p threshold, and in rare cases users may benefit from re-tuning their choice of top-p.
I typically read corporate posts as cynically as possible, since it's so common to word things in any way to make the company look better.
Glad to see an outlier!
Unless you consider service responsiveness as a factor of integrity. Still waiting on a service message reply from third week of May. I’m sure it’s right around the corner though.