How does misalignment scale with model intelligence and task complexity?

The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.

- It's short and to the point

- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term

- It's informative on how these models work, informed by some of the best in the business

- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")

> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.

Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.

My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.

So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.

Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.

We can escalate to higher authority and get out of that mess faster if we fail hard and early.

The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.

Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.

[1] - https://arxiv.org/abs/2601.14351

This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"

I think this is twofold:

1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.

2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.

This paper indicates that we should probably be less fearful of Terminator style accidental or emergent AI-misalignment. At least, as far as the existing auto-regressive LLM architecture is concerned. We may want to revisit these concerns if and when other types of artificial general intelligent models are deployed.

The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.

The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.

You simply can't have a single shot context with so many simultaneous constraints and expect to make forward progress. This cannot be solved with additional silicon, power or data.

Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.

The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.

I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.

However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.

The "natural overthinking increases incoherence" finding matches my daily experience with Claude.

I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.

Has anyone else found prompt density affects coherence?

The bias-variance framing here maps well to what I've observed building AI-assisted workflows.

In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.

Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.

The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.

One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.

This matches my intuition. Systematic misalignment seems like it could be prevented by somewhat simple rules like the hippocratic oath or Asimov's Laws of robotics or rather probabilistic bayesian versions of these rules that take into account error bounds and risk.

The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".

This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.

Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.

It's nice seeing this with Sohl-Dickstein as the last author after reading this blog post from him some time ago: https://sohl-dickstein.github.io/2023/03/09/coherence.html

Longer thinking sections have more space for noise to accumulate?

> This suggests that scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen.

This is a big deal, but are they only looking at auto-regressive models?

Well, this sounds like a "no shit Sherlock" statement: >>Finding 3: Natural "overthinking" increases incoherence more than reasoning budgets reduce it We find that when models spontaneously reason longer on a problem (compared to their median), incoherence spikes dramatically. Meanwhile, deliberately increasing reasoning budgets through API settings provides only modest coherence improvements. The natural variation dominates.<<

Language models are probabilistic and not deterministic. Therefore incoherence _by definition_ increases as a response becomes lengthier. This is not true for humans, who tend to act/communicate deterministically. If I ask the human, to read a pdf and ask, is there a word of "paperclip" in the pdf? The human deterministically will provide a yes/no answer and no matter how many times we repeat the process, they will provide the same answer consistently (not due to autocorrelation, because this can be done across different humans). LMs will have a probabilistic response - dependent on the training itself: with a very well trained model we can get a 99% probabilistic outcome, which means out of 100 simulations, it will give you 1 time the wrong answer. We have no clue about the "probablistic" component for LMs, however, simulations could be done to research this. Also, I would be very curious about autocorrelation in models. If a human did a task and came to a conclusion "yes", then he will always respond with increasing amount of eyerolling to the same task: "yes".

Also, imagine the question: "is the sky blue?" answer1: "Yes." This has 0 incoherence. answer2: "Yes, but sometimes it looks like black, sometimes blue." While this answer seemingly has 0 incoherence, the probability of increased incoherence is larger than 0 given that answer generation itself is probabilistic. Answer generation by humans is not probabilistic.

Therefore, probability driven LMs (all LMs today are probability driven) will always exhibit higher incoherence than humans.

I wonder if anybody would disagree with the above.

"model failures become increasingly dominated by incoherence rather than systematic misalignment."

This should not be surprising.

Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.

The models they tested are already way behind the current state-of-the-art. Would be interesting to see if their results hold up when repeated with the latest frontier models.

When humans dream, we are disconnected from the world around us. Without the grounding that comes from being connected to our bodies, anything can happen in a dream.

It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.

It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.

The findings are based on older models and assuming recent models behave similarly, what kind of prompt style one should use then to improve the outcome to avoid the increase in variance especially when you ask a model to solve really complex problems?

I guess it’s reassuring to know Hanlon’s Razor holds for AGI too.

Intelligence is inherently not aligned with humanity, you mean? Why am I not shocked?

My ignorant question: They did bias and variance noise, how about quantisation noise? I feel like sometimes agents are "flipfloping" between metastable divergent interpretations of the problem or solution.

This is very interesting research and a great write up.

I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.

It feels a bit like a cult.

Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.

Oh, the irony of thinking this refers to the investors and shell companies.

For some reason the article reads to me like “AI is not evil, it just has accidents when it loses coherence.” Sounds a lot like liability shifting.

I don’t know why it seems so hard for these guys to understand you scorecard every step for new strategy to Close distance at goal and if you have multiple generated forward options with no good weight you spawn a new agent and multiple paths. Then you score all the terminal branches and prune.

LLMs aren’t constrained to linear logic like your average human.

I feel vindicated when I say that the superintelligence control problem is a total farce, we won't get to superintelligence, it's tantamount to a religious belief. The real problem is the billionaire control problem. The human-race-on-earth control problem.