Don't trust large context windows

This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.

(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)

Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best.

I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.

FWIW I was still getting solid results at 70%+ context used in a Fable-1M session, so not sure if this still applies to Mythos-grade models.

Also, this post feels like it was written last year. It’s talking about plans and auto-compaction like they are new things.

The approach we're trying to deal with this context rot is using a bunch of related techniques to deal with this which we call transposing the agent loop: https://alejo.ch/3jt

In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each advances the state in a small step towards the final goal.

I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.

Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.

Considerations about what goes on in agents internally will probably not be part of software development for long.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best.

FWIW I was still getting solid results at 70%+ context used in a Fable-1M session, so not sure if this still applies to Mythos-grade models.

Also, this post feels like it was written last year. It’s talking about plans and auto-compaction like they are new things.

The approach we're trying to deal with this context rot is using a bunch of related techniques to deal with this which we call transposing the agent loop: https://alejo.ch/3jt

In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each advances the state in a small step towards the final goal.

Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.

Is it adhoc or you use more structured approaches like openspec? I also tend to work on a plan first, but it stays as in-session todo, which is hard to reference later.

I recently watched a video that put a name on something I'd been feeling. The author splits an LLM's context window into two zones. There's the smart zone, where the model is sharp, and the dumb zone, where attention drops off and the model starts forgetting what you told it five minutes ago. The cutoff sits somewhere around 100k tokens. It doesn't matter how big the advertised context window is.

This matters because coding agents will happily walk you straight into the dumb zone. A modern agent burns through tokens fast. A few file reads, a long debug session, a sprawling test run, and you're at 100k before lunch. Meanwhile vendors keep advertising windows of 200k, 1M, even 2M, as if those numbers represented a usable working set. They don't. Studies like RULER and Chroma's report on context rot show that effective context is a fraction of the advertised number, and that performance degrades gradually as you fill the window.

Large context windows are mostly a marketing number. The architectures behind them work, but they paper over a problem the underlying attention mechanism doesn't really solve. The number on the box gets bigger every release. The usable part doesn't keep up.

Modern agents are getting smart about this. Tools like Claude Code now auto-compact: when the session gets long, the agent summarizes the history and starts fresh. That helps. But auto-compaction kicks in after you've already spent time in the dumb zone, and the summary is itself produced by a model that's already degraded. Better than nothing, but I'd rather avoid the situation altogether.

What I do is open a new session and pass it a spec I wrote myself. That's a much higher signal handoff than any automated summary, because I get to decide what matters going forward. It's the breadcrumb approach applied to agents. Leave an artifact that the next session, or the next person, can pick up cleanly.

You can take this further. Projects like obra/superpowers and mattpocock/skills structure entire agent workflows around small, named artifacts. PRDs, plans, skills, sub-agent handoffs. Each one is a way to keep the working session in the smart zone by deliberately moving information out of the session into something the next session can read.

So I treat my context window like a budget. I assume only the first chunk is really working for me, and everything I can move out of the live session and into a written artifact is one less thing for attention to fight over.

Hacker Times

Hacker Times

Don't trust large context windows

Discussion

Discussion

Continue Reading