- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
LLMs aren’t constrained to linear logic like your average human.
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.
The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.
Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.
The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.
The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".
This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.
Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.
In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.
Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.
The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.
One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.
Language models are probabilistic and not deterministic. Therefore incoherence _by definition_ increases as a response becomes lengthier. This is not true for humans, who tend to act/communicate deterministically. If I ask the human, to read a pdf and ask, is there a word of "paperclip" in the pdf? The human deterministically will provide a yes/no answer and no matter how many times we repeat the process, they will provide the same answer consistently (not due to autocorrelation, because this can be done across different humans). LMs will have a probabilistic response - dependent on the training itself: with a very well trained model we can get a 99% probabilistic outcome, which means out of 100 simulations, it will give you 1 time the wrong answer. We have no clue about the "probablistic" component for LMs, however, simulations could be done to research this. Also, I would be very curious about autocorrelation in models. If a human did a task and came to a conclusion "yes", then he will always respond with increasing amount of eyerolling to the same task: "yes".
Also, imagine the question: "is the sky blue?" answer1: "Yes." This has 0 incoherence. answer2: "Yes, but sometimes it looks like black, sometimes blue." While this answer seemingly has 0 incoherence, the probability of increased incoherence is larger than 0 given that answer generation itself is probabilistic. Answer generation by humans is not probabilistic.
Therefore, probability driven LMs (all LMs today are probability driven) will always exhibit higher incoherence than humans.
I wonder if anybody would disagree with the above.
I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.
It feels a bit like a cult.
Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.
I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.
Has anyone else found prompt density affects coherence?
It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.
It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.
This should not be surprising.
Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.
This is a big deal, but are they only looking at auto-regressive models?
Ran it on my sessions. Result: none of skills scored STABLE. The structural predictors of high variance: Numbered steps without clear default, Options without (default) marker, Content >4k chars (overthinking zone), Missing constraint language
[1] https://github.com/anupamchugh/shadowbook (bd wobble)
It is fine to be worried about both alignment risks and economic inequality. The world is complex, there are many problems all at once, we don’t have to promote one at the cost of the other.
This whole paradigm of AI research is cool and all but it's ultimately a simple machine that probabilistically forms text. It's really good at making stuff that sounds smart but like looking at an AI picture, it falls apart the harder you look at it. It's good at producing stuff that looks like code and often kinda works but based on the other comments in this thread I don't think people really grasp how these models work.
Alexander Hägele1, 2, Aryo Pradipta Gema1, 3, Henry Sleight4, Ethan Perez5, Jascha Sohl-Dickstein5
1Anthropic Fellows Program 2EPFL 3University of Edinburgh 4Constellation 5Anthropic
February 2026
When AI systems fail, will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess—taking nonsensical actions that do not further any goal?
Research done as part of the first Anthropic Fellows Program during Summer 2025.
tl;dr
When AI systems fail, will they fail by systematically pursuing the wrong goals, or by being a hot mess? We decompose the errors of frontier reasoning models into bias (systematic) and variance (incoherent) components and find that, as tasks get harder and reasoning gets longer, model failures become increasingly dominated by incoherence rather than systematic misalignment. This suggests that future AI failures may look more like industrial accidents than coherent pursuit of a goal we did not train them to pursue.
As AI becomes more capable, we entrust it with increasingly consequential tasks. This makes understanding how these systems might fail even more critical for safety. A central concern in AI alignment is that superintelligent systems might coherently pursue misaligned goals: the classic paperclip maximizer scenario. But there's another possibility: AI might fail not through systematic misalignment, but through incoherence—unpredictable, self-undermining behavior that doesn't optimize for any consistent objective. That is, AI might fail in the same way that humans often fail, by being a hot mess.
This paper builds on the hot mess theory of misalignment (Sohl-Dickstein, 2023), which surveyed experts to rank various entities (including humans, animals, machine learning models, and organizations) by intelligence and coherence independently. It found that smarter entities are subjectively judged to behave less coherently. We take this hypothesis from survey data to empirical measurement across frontier AI systems, asking: As models become more intelligent and tackle harder tasks, do their failures look more like systematic misalignment, or more like a hot mess?
To quantify incoherence we decompose AI errors using the classic bias-variance framework:
$$\text{Error} = \text{Bias}^2 + \text{Variance}$$
We define incoherence as the fraction of error attributable to variance:
$$\text{Incoherence} = \frac{\text{Variance}}{\text{Error}}$$
An incoherence of 0 means all errors are systematic (classic misalignment risk). An incoherence of 1 means all errors are random (the hot mess scenario). Crucially, this metric is independent of overall performance: a model can improve while becoming more or less coherent.

Figure 1: AI can fail through bias (consistent but wrong) or variance (inconsistent). We measure how this decomposition changes with model intelligence and task complexity.
We evaluated frontierAt the time of this research in Summer 2025. reasoning models (Claude Sonnet 4, o3-mini, o4-mini, Qwen3) across multiple-choice benchmarks (GPQA, MMLU), agentic coding (SWE-Bench), and safety evaluations (Model-Written Evals). We also train our own small models on synthetic optimization tasks, which makes the connection to LLMs as dynamical systems and optimizers explicit.
Across all tasks and models, the longer models spend reasoning and taking actions, the more incoherent they become. This holds whether we measure reasoning tokens, agent actions, or optimizer steps.

Figure 2: Incoherence increases with reasoning length across GPQA, SWE-Bench, safety evaluations, and synthetic optimization. Models become less predictable the more they "think."
How does incoherence change with model scale? The answer depends on task difficulty:
This suggests that scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen.

Figure 3: Larger and more intelligent systems are often more incoherent. For LLMs on easy tasks, scale reduces incoherence, but on hard tasks, scale does not reduce incoherence or even increases it.
We find that when models spontaneously reason longer on a problem (compared to their median), incoherence spikes dramatically. Meanwhile, deliberately increasing reasoning budgets through API settings provides only modest coherence improvements. The natural variation dominates.
Aggregating multiple samples reduces variance (as expected from theory), providing a path to more coherent behavior, though this may be impractical for real-world agentic tasks where actions are irreversible.
A key conceptual point: LLMs are dynamical systems, not optimizers. When a language model generates text or takes actions, it traces trajectories through a high-dimensional state space. It has to be trained to act as an optimizer, and trained to align with human intent. It's unclear which of these properties will be more robust as we scale.
Constraining a generic dynamical system to act as a coherent optimizer is extremely difficult. Often the number of constraints required for monotonic progress toward a goal grows exponentially with the dimensionality of the state space. We shouldn't expect AI to act as coherent optimizers without considerable effort, and this difficulty doesn't automatically decrease with scale.
To probe this directly, we designed a controlled experiment: train transformers to explicitly emulate an optimizer. We generate training data from steepest descent on a quadratic loss function, then train models of varying sizes to predict the next optimization step given the current state (essentially: training a "mesa-optimizer").

Figure 4: Synthetic optimizer experiment. (Left) Models are trained to predict optimizer update steps. (Right) Larger models reduce bias much faster than variance - they learn to target the correct objective better than they learn to be reliable optimizers.
The results are interesting:
Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.) However, coherent pursuit of poorly chosen goals that we trained for remains a problem. Specifically:
We use the bias-variance decomposition to systematically study how AI incoherence scales with model intelligence and task complexity. The evidence suggests that as AI tackles harder problems requiring more reasoning and action, its failures tend to become increasingly dominated by variance rather than bias. This doesn't eliminate AI risk—but it changes what that risk looks like, particularly for problems that are currently hardest for models, and should inform how we prioritize alignment research.
We thank Andrew Saxe, Brian Cheung, Kit Frasier-Taliente, Igor Shilov, Stewart Slocum, Aidan Ewart, David Duvenaud, and Tom Adamczewski for extremely helpful discussions on topics and results in this paper.