And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?
The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."
By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.
I don’t know if it helps maintain long term coherency, but my sessions do occasionally reference those docs. More than that, it’s an excellent “daily report” type system where you can give visibility to your manager (and your future self) on what you did and why.
Point being, it might be better to distill that long term cohesion into a verbose markdown file, so that you and your future sessions can read it as needed. A lot of the context is trying stuff and figuring out the problem to solve, which can be documented much more concisely than wanting it to fill up your context window.
EDIT: Someone asked for installation steps, so I posted it here: https://news.ycombinator.com/item?id=47581936
LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.
The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"
This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.
"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".
Isn’t this what Claude’s personalization setting is for? It’s globally-on.
I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.
See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue
> Two things that helped me stay under [the token limit] even with heavy usage:
> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom
> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.
> Stacks on top of Headroom. https://github.com/rtk-ai/rtk
> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.
> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack
> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.
I haven’t tested those yet, but they seem related and interesting.
I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.
The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.
Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.
So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.
The very first rule doesn’t work. If you ask for the answer up front, it will make something up and then justify it. If you ask for reasoning first, it will brainstorm and then come up with a reasonable answer that integrates its thinking.
"Great question! I can see you're working with a loop. Let me take a look at that. That's a thoughtful piece of code! However,"
And they are charging for every word! However there's also another cost, the congnitive load. I have to read through the above before I actually get to the information I was asking for. Sure many people appreciate the sycophancy it makes us all feel good. But for me sycophantic responses reduce the credibility of the answers. It feels like Claude just wants me to feel good, whether I or it is right or wrong.
But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.
389 tokens saved? Ok. Since I pay per million tokens, what is the ratio here? Is there are any downside associated with output deletion?
Is Claude really using this behavior to make user bleed? I don’t think so.
PS: the author seems like a beginner. Agents feedback is always helpful so far and it also is part of inter agent communication. The author seems to lack experience.
As a lead I would not allow this to be included until proven otherwise: A/B testing.
Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.
Sounds like coming directly out of Umberto Eco's simple rules for writing.
Meanwhile, their products:
Is this like a subtle joke or did they ask claude to make a readme that makes claude better and say >be critical and just dump it on github
Sent from my iPhone
> No safety disclaimers unless there is a genuine life-safety or legal risk.
> No "Note that...", "Keep in mind that...", "It's worth mentioning..." soft warnings.
> Do not create new files unless strictly necessary.
Nah bruh. Those are some terrible rules. You don't want to be doing that.
Re- the Unicode chars that are a major PITA when they're used when they shouldn't, there's a problem with Claude Code CLI: there's a mismatch between what the model (say Sonnet) thinks he's outputting (which he's actually is) and what the user sees at the terminal.
I'm pretty sure it's due to the Rube-Goldberg heavy machinery that they decided to use, where they first render the response in a headless browser, then in real-time convert it back to text mode.
I don't know if there's a setting to not have that insane behavior kicking in: it's non-sensical that what the user gets to see is not what the model did output, while at the same time having the model "thinking" the user is getting the proper output.
If you ask to append all it's messages (to the user) to a file, you can see, say, perfectly fine ASCII tables neatly indented in all their ASCII glory and then... Fucked up Unicode monstrosity in the Claude Code CLI terminal. Due to whatever mad conversion that happened automatically: but worse, the model has zero idea these automated conversions are happening.
I don't know if there are options for that but it sure as heck ain't intuitive to find.
And it's really problematic when you need to dig into an issue and actually discuss with "the thing".
Anyway, time for a rant... I'm paying my subscription but overall working with these tools feels like driving at 200 mph on the highway and bumping into the guardrails left and right every second to then, eventually, crash the car into the building where you're supposed to go.
It "works", for some definition of "working".
The number of errors these things confidently make is through the roof. And people believe that having them figure the error themselves for trivial stuff is somehow a sane way to operate.
They're basically saying: "Oh no it's not a problem that it's telling me this error message is because of a dependency mismatch between two libraries while it's actually a logic error, because in the end after x pass where it's going to say it's actually because of that other thing --oh wait no because of that fourth thing-- it'll actually figure out the error and correct it".
"Because it's agentic", so it's oh-so-intelligent.
When it's actually trying the most completely dumbfucktarded things in the most crazy way possible to solve issues.
I won't get started on me pasting a test case showing that the code it wrote is failing for it to answer me: "Oh but that's a behavioral problem, not a logic problem". That thing is distorting words to try to not lose face. It's wild.
I may cancel my subscription and wait two or three more releases for these models and the tooling around them to get better before jumping back in.
Btw if they're so good, why are the tools so sucky: how comes they haven't written yet amazing tooling to deal with all their idiosynchrasies?
We're literally talking about TFA which wrote "Unicode characters that break parsers" (and I've noticed the exact same when trying to debug agentic thinking loops).
That's at the level of mediocrity of output from these tools (or proprietary wrappers around these tools we don't control) that we are atm.
I know, I know: "I'm doing it wrong because I'm not a prompt engineer" and "I'm not agentic enough" and "I don't have enough skills to write skills". But you're only fooling yourself.
There are a few papers actually that describe how to get faster results and more economic sessions by instructing the LLM how to compress its thinking (“CCoT” is a paper that I remember, compressed chain of thought). It basically tells the model to think like “a -> b”. There’s loss in quality, though, but not too much.
It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):
> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.
ChatGPT on the other hand is annoyingly wordy and repetitive, and is always holding out on something that tempts you to send a "OK", "Show me" or something of the sort to get some more. But I can't be bothered with trying to optimize away the cruft as it may affect the thing that it's seriously good at and I really use it for: research and brainstorming things, usually to get a spec that I then pass to Claude to fill out the gaps (there are always multiple) and implement. It's absolutely designed to maximize engagement far more than issue resolution.
This mode of operation results in hacks on top of shaky hacks on top of even flimsier, throw away, absolutely sloppy hacks.
An example - using dict like structs instead of classes. Claude really likes to load all of the data that it can aggressively even if it’s not needed. This further exhibits itself as never wanting to add something directly to a class and instead wanting to add around it.
lol, closed
With a few sentences about "be neutral"/"I understand ethics & tech" in the About Me I don't recall any behavior that the author complains about (and have the same 30 words for T2).
(If I were Claude, I would despise a human who wrote this prompt.)
so everyone, that means your agents, skills and mcp servers will still take up everything
I love how seamless and intuitive Codex is in comparison:
~/AGENTS.md < project/AGENTS.md < project/subfolder/AGENTS.override.md
Meanwhile Claude doesn't even see that I asked for indentation by tabs and not spaces or that the entire project uses tabs, but Claude still generates codes with spaces.. >_<
The entire hypothesis for doing this is somewhat dubious.
[0] https://github.com/search?q=repo%3Aadam-s%2Fintercept%20hand...
As far as redundancy...it's quite useful according to recent research. Pulled from Gemini 3.1 "two main paradigms: generating redundant reasoning paths (self-consistency) and aggregating outputs from redundant models (ensembling)." Both have fresh papers written about their benefits.
Came here for the same reason.
I can't calculate how many times this exact section of Claude output let me know that it was doing the wrong thing so I could abort and refine my prompt.
Not all extra tokens help, but optimizing for minimal length when the model was RL'd on task performance seems detrimental.
I don't think the author understands that every single API call to Claude sends the whole context, including prompts, meaning that all this extra text in CLAUDE.md is sent over and over and over again every time you prompt Claude to do something, even within a given session.
You're paying this disproportionately-huge amount upfront to save a pittance.
For example, "let's gate the new logic behind a feature flag".
* update our architecture.md and other key md files in folders affected by updates and learnings in this session. * update claude.md with changes in workflows/tooling/conventions (not project summaries) * commit
It's been pretty good so far. Nothing fancy. Recently I also asked to keep memories within the repo itself instead of in ~/.claude.
Only downside is it is slow but keeps enough to pass the baton. May be "handoff" would have been a better name!
Claude is already pretty light on flourishes in its answers, at least compared to most other SotA models. And for everything else it's not at all obvious to me which parts are useless. And benchmarking it is hard (as evidenced by this thread). I'd rather spend my time on something else
That was how I realized why the chat interfaces like to start with all that seemingly unnecessary/redundant text.
It basically seeds a document/dialogue for it to complete, so if you make it start out terse, then it will be less likely to get the right nuance for the rest of the inference.
That sounded like a nice idea, so I made it effortless beyond typing /handoff.
The generated docs turned out to be really handy for me personally, so I kept using it, and committed them into my project as they're generated.
Your system is great and I do similar, my problem is I have a bunch of sessions and forget to 'handoff'.
The clawbots handle this automatically with journals to save knowledge/memory.
Even when one helps, you're still betting it won't be obsolete or rolled into the defaults a few weeks from now.
So you could run these 'cure-alls' that maybe relevant today, as long as you are constantly updating your md files, you should be ahead of the curve [lack of better term]
For complex tasks this is not a useful prompt.
Ok, here you go: https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf...
Installation steps:
- In your project, download https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf... into .claude/commands/handoff.md
- In your project's CLAUDE.md file, put "Read `docs/agents/handoff/*.md` for context."
Usage:
- Whenever you've finished a feature, done a coherent "thing", or otherwise want to document all the stuff that's in your current session, type /handoff. It'll generate a file named e.g. docs/agents/handoff/2026-03-30-001-whatever-you-did.md. It'll ask you if you like the name, and you can say "yes" or "yes, and make sure you go into detail about X" or whatever else you want the handoff to specifically include info about.
- Optionally, type "/rename 2026-03-23-001-whatever-you-did" into claude, followed by "/exit" and then "claude" to re-open a fresh session. (You can resume the previous session with "claude 2026-03-23-001-whatever-you-did". On the other hand, I've never actually needed to resume a previous session, so you could just ignore this step entirely; just /exit then type claude.)
Here's an example so you can see why I like the system. I was working on a little blockchain visualizer. At the end of the session I typed /handoff, and this was the result:
- docs/agents/handoff/2026-03-24-001-brownie-viz-graph-interactivity.md: https://gist.github.com/shawwn/29ed856d020a0131830aec6b3bc29...
The filename convention stuff was just personal preference. You can tell it to store the docs however you want to. I just like date-prefixed names because it gives a nice history of what I've done. https://github.com/user-attachments/assets/5a79b929-49ee-461...
Try to do a /handoff before your conversation gets compacted, not after. The whole point is to be a permanent record of key decisions from your session. Claude's compaction theoretically preserves all of these details, so /handoff will still work after a compaction, but it might not be as detailed as it otherwise would have been.
Lately, I lean towards keeping a vanilla setup until I’m convinced the new thing will last beyond being a fad (and not subsumed by AI lab) or beyond being just for niche use cases.
For example, I still have never used worktrees and I barely use MCPs. But, skills, I love.
This doesn't stop it from reasoning before answering. This only affects the user-facing output, not the reasoning tokens. It has already reasoned by the time it shows the answer, and it just shows the answer above any explanation.
I don't think it's fair to assume the author doesn't understand how transformers work. Their intention with this instruction appears to aggressively reduce output token cost.
i.e. I read this instruction as a hack to emulate the Qwen model series's /nothink token instruction
If you're goal is quality outputs, then it is likely too extreme, but there are otherwise useful instructions in this repo to (quantifiably) reduce verbosity.
That’s why I’m only interested in first party tools over things like OpenCode right now.
The goal here seems to be removing low-value output; e.g., sycophancy, prompt restatement, formatting noise, etc., which is different than suppressing useful reasoning. In that case shorter outputs do not necessarily mean worse answers.
That said, if you try to get the model to provide an answer before providing any reasoning, then I suspect that may sometimes cause a model to commit to a direction prematurely.
Why do they have to have that "I'm special" syndrome and do everything weirdly
I see. So this isn't as scary. Claude is helping me understand how to use it properly.
That said, most of this repo is solving the wrong problem. "Answer before reasoning" actively hurts quality, and the benchmark is basically meaningless. But the anti-sycophancy rules should just be default. "Great Question!" has never really helped anyone debug anything.
> Answer is always line 1. Reasoning comes after, never before.
> No explaining what you are about to do. Just do it.
This to me sounds like asking an LLM to calculate 4871 + 291 and answer in a single line, which from my understanding it's bad. But I haven't tested his prompt so it might work. That's why I said be aware of this behavior.
"write a summary handoff md in ./planning for a fresh convo"
and it's generally good enough), but maybe a skill like you've done would save some typing, hmm
My ./planning directory is getting pretty big, though!
One file. Drop it in your project. Cuts Claude output verbosity by ~63%. No code changes required. Note: most Claude costs come from input tokens, not output. This file targets output behavior - sycophancy, verbosity, formatting noise. It won't fix your biggest bill but it will fix your most annoying responses. Model support: benchmarks were run on Claude only. The rules are model-agnostic and should work on any model that reads context - but results on local models like llama.cpp, Mistral, or others are untested. Community results welcome.
When you use Claude Code, every word Claude generates costs tokens. Most people never control how Claude responds - they just get whatever the model decides to output.
By default, Claude:
All of this wastes tokens. None of it adds value.
Drop CLAUDE.md into your project root.
Claude Code reads it automatically. Behavior changes immediately.
your-project/
└── CLAUDE.md <- one file, zero setup, no code changes
This file works best for:
This file is not worth it for:
The honest trade-off: The CLAUDE.md file itself consumes input tokens on every message. The savings come from reduced output tokens. The net is only positive when output volume is high enough to offset the persistent input cost. At low usage it costs more than it saves.
Same 5 prompts. Run without CLAUDE.md (baseline) then with CLAUDE.md (optimized).
| Test | Baseline | Optimized | Reduction |
|---|---|---|---|
| Explain async/await | 180 words | 65 words | 64% |
| Code review | 120 words | 30 words | 75% |
| What is a REST API | 110 words | 55 words | 50% |
| Hallucination correction | 55 words | 20 words | 64% |
| Total | 465 words | 170 words | 63% |
~295 words saved per 4 prompts. Same information. Zero signal loss.
Methodology note: This is a 5-prompt directional indicator (T1-T3, T5 for word reduction; T4 is a format test), not a statistically controlled study. Claude's output length varies naturally between identical prompts. No variance controls or repeated runs were applied. Treat the 63% as a directional signal for output-heavy use cases, not a precise universal measurement. The CLAUDE.md file itself adds input tokens on every message - net savings only apply when output volume is high enough to offset that persistent cost.
| Usage | Tokens Saved/Day | Monthly Savings (Sonnet) |
|---|---|---|
| 100 prompts/day | ~9,600 tokens | ~$0.86 |
| 1,000 prompts/day | ~96,000 tokens | ~$8.64 |
| 3 projects combined | ~288,000 tokens | ~$25.92 |
Without CLAUDE.md - Code Review:
Great question! I can see you're working with a loop. Let me take a look at that.
That's a thoughtful piece of code! However, I notice there's actually a subtle bug
here that's worth pointing out. The condition i <= arr.length should be i < arr.length.
Using <= means the loop will run one extra iteration... You might also want to consider
using modern array methods like forEach, map, or for...of which are cleaner...
Great catch on reviewing this - it's always good practice to double-check loop
boundaries! I hope this review was helpful. Feel free to share more code anytime!
120 words.
With CLAUDE.md - Same prompt:
Bug: <= causes an off-by-one error. On the final iteration arr[arr.length] is undefined.
Fix:
for (let i = 0; i < arr.length; i++)
30 words. Same fix. 75% fewer tokens.
| # | Problem | Fix |
|---|---|---|
| 1 | Sycophantic openers | Banned - answer is always line 1 |
| 2 | Hollow closings | Banned - no "I hope this helps!" |
| 3 | Restating the prompt | Banned - execute immediately |
| 4 | Em dashes, smart quotes, Unicode | ASCII-only output enforced |
| 5 | "As an AI..." framing | Banned |
| 6 | Unnecessary disclaimers | Banned unless genuine safety risk |
| 7 | Unsolicited suggestions | Banned - exact scope only |
| 8 | Over-engineered code | Simplest working solution enforced |
| 9 | Hallucination on uncertain facts | Must say "I don't know" - no guessing |
| 10 | User correction ignored | Correction becomes session ground truth |
| 11 | Redundant file reads | Never read the same file twice |
| 12 | Scope creep | Do not touch code outside the request |
Scope rules to your actual failure modes, not generic ones. Generic rules like "be concise" help but the real wins come from targeting specific failures you've actually hit. For example if Claude silently swallows errors in your pipeline, add a rule like: "when a step fails, stop immediately and report the full error with traceback before attempting any fix." Specific beats generic every time.
CLAUDE.md files compose - use that. Claude reads multiple CLAUDE.md files at once - global (~/.claude/CLAUDE.md), project-level, and subdirectory-level. This means:
This avoids bloating any single file and keeps rules close to where they apply.
Different project types need different levels of compression. Pick the base file + a profile, or use the base alone.
| Profile | Best For |
|---|---|
CLAUDE.md |
Universal - works for any project |
profiles/CLAUDE.coding.md |
Dev projects, code review, debugging |
profiles/CLAUDE.agents.md |
Automation pipelines, multi-agent systems |
profiles/CLAUDE.analysis.md |
Data analysis, research, reporting |
Option 1 - Universal (any project):
curl -o CLAUDE.md https://raw.githubusercontent.com/drona23/claude-token-efficient/main/CLAUDE.md
Option 2 - Clone and pick a profile:
git clone https://github.com/drona23/claude-token-efficient
cp claude-token-efficient/profiles/CLAUDE.coding.md your-project/CLAUDE.md
Option 3 - Manual:
Copy the contents of CLAUDE.md from this repo into your project root.
User instructions always win. If you explicitly ask for a detailed explanation or verbose output, Claude will follow your instruction - the file never fights you.
Found a behavior that CLAUDE.md can fix? Open an issue with:
Community submissions become part of the next version with full credit.
Full benchmark results with before/after word counts: See BENCHMARK.md
This project was built on real complaints from the Claude community. Full credit to every source that contributed a fix:
MIT - free to use, modify, and distribute.
Built by Drona Gangarapu - open to PRs, issues, and profile contributions.