Stop Burning Your Context Window – How We Cut MCP Output by 98% in Claude Code

Nice work.

It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).

Author here. I shared the GitHub repo a few days ago (https://news.ycombinator.com/item?id=47148025) and got great feedback. This is the writeup explaining the architecture.

The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.

Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.

Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.

I did this accidentally while porting Go to IRIX: https://github.com/unxmaal/mogrix/blob/main/tools/knowledge-...

This sounds a little bit like rkt? Which trims output from other CLI applications like git, find and the most common tools used by Claude. This looks like it goes a little further which is interesting.

I see some of these AI companies adopting some of these ideas sooner or later. Trim the tokens locally to save on token usage.

https://github.com/rtk-ai/rtk

Do you need 80+ tools in context? Even if reduced, why not use sub agents for areas of focus? Context is gold and the more you put into it unrelated to the problem at hand the worse your outcome is. Even if you don't hit the limit of the window. Would be like compressing data to read into a string limit rather than just chunking the data

AFAIK Claude Code doesn't inject all the MCP output into the context. It limits 25k tokens and uses bash pipe operators to read the full output. That's at least what I see in the latest version.

I've seen a few projects like this. Shouldn't they in theory make the llms "smarter" by not polluting the context? Have any benchmarks shown this effect?

Excited to try this. Is this not in effect a kind of "pre-compaction," deciding ahead of time what's relevant? Are there edge cases where it is unaware of, say, a utility function that it coincidentally picks up when it just dumps everything?

I am a happy user of this and have recommended my team also install it. It’s made a sizable reduction in my token use.

this is going to crash the AI economy. nvda down 20 percent monday. lol

The 98% reduction is the real story here, but the systemic problem you're solving is even bigger than individual tool calls blowing up context. When you're orchestrating multi-step workflows, each tool output becomes part of the conversation state that carries forward to the next step. A Playwright snapshot at step 1 is 56 KB. It still counts at step 3 when you've moved on to something completely different.

The subprocess isolation is smart - stdout-only is the right constraint. I've been running multi-agent workflows where the cost of tool output accumulation forces you to make bad decisions: either summarise outputs manually (defeating the purpose of tool calls), truncate logs (information loss), or cap the workflow depth. None of them good.

The search ranking piece is worth noting. Most people just grep logs or dump chunks and let the LLM sort it out. BM25 + FTS5 means you're pre-filtering at index time, not letting the model do relevance ranking on the full noise. That's the difference between usable and unusable context at scale.

Only question: how does credential passthrough work with MCP's protocol boundaries? If gh/aws/gcloud run in the subprocess, how does the auth state persist between tool calls, or does each call reinit?

this is going to crash the AI economy. nvda down 20 percent monday. lol

I see some of these AI companies adopting some of these ideas sooner or later. Trim the tokens locally to save on token usage.

https://github.com/rtk-ai/rtk

I’m also trying to see which one makes more sense. Discussion about rtk started today: https://news.ycombinator.com/item?id=47189599

Haven't looked at rtk closely but from the description it sounds like it works at the CLI output level, trimming stdout before it reaches the model. Context-mode goes a bit further since it also indexes the full output into a searchable FTS5 database, so the model can query specific parts later instead of just losing them. It's less about trimming and more about replacing a raw dump with a summary plus on-demand retrieval.

I've seen a few projects like this. Shouldn't they in theory make the llms "smarter" by not polluting the context? Have any benchmarks shown this effect?

I did this accidentally while porting Go to IRIX: https://github.com/unxmaal/mogrix/blob/main/tools/knowledge-...

Nice work.

That's the theory and it does hold up in practice. When context is 70% raw logs and snapshots, the model starts losing track of the actual task. We haven't run formal benchmarks on answer quality yet, mostly focused on measuring token savings. But anecdotally the biggest win is sessions lasting longer before compaction kicks in, which means the model keeps its full conversation history and makes fewer mistakes from lost context.

Nice approach. Same core idea as context-mode but specialized for your build domain. You're using SQLite as a structured knowledge cache over YAML rule files with keyword lookup. Context-mode does something similar but domain-agnostic, using FTS5 with BM25 ranking so any tool output becomes searchable without needing predefined schemas. Cool to see the pattern emerge independently from a completely different use case.

Agree. I’d like more fine grained control of context and compaction. If you spend time debugging in the middle of a session, once you’ve fixed the bugs you ought to be able to remove everything related to fixing them out of context and continue as you had before you encountered them. (Right now depending on your IDE this can be quite annoying to do manually. And I’m not aware of any that allow you to snip it out if you’ve worked with the agent on other tasks afterwards.)

I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.

Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.

Totally agree. Failed attempts are just noise once the right path is found. Auto-detecting retry patterns and pruning them down to the final working version feels very doable, especially for clear cases like lint or compilation fixes.

Maybe the right answer is “why not both”, but subagents can also be used for that problem. That is, when something isn’t going as expected, fork a subagent to solve the problem and return with the answer.

It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)

It feels like the late 1990s all over again, but instead of html and sql, it’s coding agents. This time around, a lot of us are well experienced at software engineering and so we can find optimizations simply by using claude code all day long. We get an idea, we work with ai to help create a detailed design and then let it develop it for us.

I am a happy user of this and have recommended my team also install it. It’s made a sizable reduction in my token use.

Thanks, really appreciate hearing that! Glad it's working well for your team.

Author here. I shared the GitHub repo a few days ago (https://news.ycombinator.com/item?id=47148025) and got great feedback. This is the writeup explaining the architecture.

Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.

Really intrigued and def will try, thanks for this.

In connecting the dots (and help me make sure I'm connecting them correctly), context-mode _does not address MCP context usage at all_, correct? You are instead suggesting we refactor or eliminate MCP tools, or apply concepts similar to context_mode in our MCPs where possible?

Context-mode is still very high value, even if the answer is "no," just want to make sure I understand. Also interested in your thoughts about the above.

I write a number of MCPs that work across all Claude surfaces; so the usual "CLI!" isn't as viable an answer (though with code execution it sometimes can be) ...

Edit: typo

Does your technique break the cache? edit: Thanks.

That's a fair point and honestly the ideal approach. But in practice most people don't hand-curate their MCP server list per task. They install 5-6 servers and suddenly have 80 tools loaded by default. Context-mode doesn't solve the tool definition bloat, that's the input side problem. It handles the output side, when those tools actually run and dump data back. Even with a focused set of tools, a single Playwright snapshot or git log can burn 50k tokens. That's what gets sandboxed.

Yeah it's basically pre-compaction, you're right. The key difference is nothing gets thrown away. The full output sits in a searchable FTS5 index, so if the model realizes it needs some detail it missed in the summary, it can search for it. It's less "decide what's relevant upfront" and more "give me the summary now, let me come back for specifics later."

AFAIK Claude Code doesn't inject all the MCP output into the context. It limits 25k tokens and uses bash pipe operators to read the full output. That's at least what I see in the latest version.

That's true, Claude Code does truncate large outputs now. But 25k tokens is still a lot, especially when you're running multiple tools back to back. Three or four Playwright snapshots or a batch of GitHub issues and you've burned 100k tokens on raw data you only needed a few lines from. Context-mode typically brings that down to 1-2k per call while keeping the full output searchable if you need it later.

No magic — standard Unix process inheritance. Each execute() spawns a child process via Node's child_process.spawn() with a curated env built by #buildSafeEnv (https://github.com/mksglu/claude-context-mode/blob/main/cont...). It passes through an explicit allowlist of auth vars (GH_TOKEN, AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, KUBECONFIG, etc.) plus HOME and XDG paths so CLI tools find their config files on disk. No state persists between calls — each subprocess inherits credentials from the MCP server's environment, runs, and exits. This works because tools like gh and aws resolve auth on every invocation anyway (env vars or ~/.config files). The tradeoff is intentional: allowlist over full process.env so the sandbox doesn't leak unrelated vars.

I’m also trying to see which one makes more sense. Discussion about rtk started today: https://news.ycombinator.com/item?id=47189599

Thanks, really appreciate hearing that! Glad it's working well for your team.

Yeah I like this approach too. I made a tool similar to Beads and after learning about RTK I updated mine to produce less token hungry output. I'm still working on it.

https://github.com/Giancarlos/guardrails

Does context mode only work with MCPs? Or does it work with bash/git/npm commands as well?

Really intrigued and def will try, thanks for this.

Context-mode is still very high value, even if the answer is "no," just want to make sure I understand. Also interested in your thoughts about the above.

I write a number of MCPs that work across all Claude surfaces; so the usual "CLI!" isn't as viable an answer (though with code execution it sometimes can be) ...

Edit: typo

Does your technique break the cache? edit: Thanks.

Right, context-mode doesn't change how MCP tool definitions get loaded into context. That's the "input side" problem that Cloudflare's Code Mode tackles by compressing tool schemas. Context-mode handles the "output side," the data that comes back from tool calls. That said, if you're writing your own MCPs, you could apply the same pattern directly. Instead of returning raw payloads, have your MCP server return a compact summary and store the full output somewhere queryable. Context-mode just generalizes that so you don't have to rebuild it per server.

Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.

Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.

Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.

> For example, if you’re working with a tool that dumps a lot of logged information into context

I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.

Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.

It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)

I forget where now but I'm sure I read an article from one of the coding harness companies talking about how they'd done just that. Effectively it could pass a note to its past self saying "Path X doesn't work", and otherwise reset the context to any previous point.

I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.

The people who spent years doing the work manually are the ones who immediately see where the bottlenecks are.

> For example, if you’re working with a tool that dumps a lot of logged information into context

The people who spent years doing the work manually are the ones who immediately see where the bottlenecks are.

I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.

That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

Two LLMs speaking with each other on HN? Amusing!

Does context mode only work with MCPs? Or does it work with bash/git/npm commands as well?

Yeah I like this approach too. I made a tool similar to Beads and after learning about RTK I updated mine to produce less token hungry output. I'm still working on it.

https://github.com/Giancarlos/guardrails

Every MCP tool call in Claude Code dumps raw data into your 200K context window. A Playwright snapshot costs 56 KB. Twenty GitHub issues cost 59 KB. One access log — 45 KB. After 30 minutes, 40% of your context is gone.

Context Mode is an MCP server that sits between Claude Code and these outputs. 315 KB becomes 5.4 KB. 98% reduction.

The Problem

MCP became the standard way for AI agents to use external tools. But there's a tension at its core: every tool interaction fills the context window from both sides — definitions on the way in, raw output on the way out.

With 81+ tools active, 143K tokens (72%) get consumed before your first message. Then the tools start returning data. A single Playwright snapshot burns 56 KB. A gh issue list dumps 59 KB. Run a test suite, read a log file, fetch documentation — each response eats into what remains.

Cloudflare showed that tool definitions can be compressed by 99.9% with Code Mode. We asked: what about the other direction?

How the Sandbox Works

Each execute call spawns an isolated subprocess with its own process boundary. Scripts can't access each other's memory or state. The subprocess runs your code, captures stdout, and only that stdout enters the conversation context. The raw data — log files, API responses, snapshots — never leaves the sandbox.

Ten language runtimes are available: JavaScript, TypeScript, Python, Shell, Ruby, Go, Rust, PHP, Perl, R. Bun is auto-detected for 3-5x faster JS/TS execution.

Authenticated CLIs (gh, aws, gcloud, kubectl, docker) work through credential passthrough — the subprocess inherits environment variables and config paths without exposing them to the conversation.

How the Knowledge Base Works

The index tool chunks markdown content by headings while keeping code blocks intact, then stores them in a SQLite FTS5 (Full-Text Search 5) virtual table. Search uses BM25 ranking — a probabilistic relevance algorithm that scores documents based on term frequency, inverse document frequency, and document length normalization. Porter stemming is applied at index time so "running", "runs", and "ran" match the same stem.

When you call search, it returns exact code blocks with their heading hierarchy — not summaries, not approximations, the actual indexed content. fetch_and_index extends this to URLs: fetch, convert HTML to markdown, chunk, index. The raw page never enters context.

The Numbers

Validated across 11 real-world scenarios — test triage, TypeScript error diagnosis, git diff review, dependency audit, API response processing, CSV analytics. All under 1 KB output each.

Playwright snapshot: 56 KB → 299 B
GitHub issues (20): 59 KB → 1.1 KB
Access log (500 requests): 45 KB → 155 B
Analytics CSV (500 rows): 85 KB → 222 B
Git log (153 commits): 11.6 KB → 107 B
Repo research (subagent): 986 KB → 62 KB (5 calls vs 37)

Over a full session: 315 KB of raw output becomes 5.4 KB. Session time before slowdown goes from ~30 minutes to ~3 hours. Context remaining after 45 minutes: 99% instead of 60%.

Install

Two ways. Plugin Marketplace gives you auto-routing hooks and slash commands:

/plugin marketplace add mksglu/claude-context-mode
/plugin install context-mode@claude-context-mode

Or MCP-only if you just want the tools:

claude mcp add context-mode -- npx -y context-mode

Restart Claude Code. Done.

What Actually Changes

You don't change how you work. Context Mode includes a PreToolUse hook that automatically routes tool outputs through the sandbox. Subagents learn to use batch_execute as their primary tool. Bash subagents get upgraded to general-purpose so they can access MCP tools.

The practical difference: your context window stops filling up. Sessions that used to hit the wall at 30 minutes now run for 3 hours. The same 200K tokens, used more carefully.

Why We Built This

I run the MCP Directory & Hub. 100K+ daily requests. See every MCP server that ships. The pattern was clear: everyone builds tools that dump raw data into context. Nobody was solving the output side.

Cloudflare's Code Mode blog post crystallized it. They compressed tool definitions. We compress tool outputs. Same principle, other direction.

Built it for my own Claude Code sessions first. Noticed I could work 6x longer before context degradation. Open-sourced it.

Open source. MIT. github.com/mksglu/claude-context-mode

Mert Köseoğlu, Senior Software Engineer, AI consultant. x.com/mksglu · linkedin.com/in/mksglu · mksg.lu

Two LLMs speaking with each other on HN? Amusing!

Why are you assuming they’re an LLM? And please don’t say “em dash”.

Note: you’re replying to the library’s author.

Why are you assuming they’re an LLM? And please don’t say “em dash”.

Note: you’re replying to the library’s author.

1st comment: 2 day old account, "is the real story here", summary -> comment -> question, general punchiness of style without saying that much. These llms feel like someone said "be an informal hacker news commenter" so they often end with "Curious how" instead of "I'm curious how" or "Worth building" instead of "It's worth building". Not that humans don't do any of this but all of it together in their comment history, you just get a general vibe.

author reply: not as obvious, but for one thing yes literally em dash, their post has 10 em dashes in 748 words, this comment has 2 em dashes in 115 words. Not that em dash = ai, but in the context of a post about AI it seems more likely. And finally, https://github.com/mksglu/claude-context-mode/blob/main/cont... the file the author linked in their own repo does not exist!

(https://github.com/mksglu/claude-context-mode/blob/main/src/... exists but they messed up the link?)

The first two sentences of the first two paragraphs of OP are a dead giveaway.