I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
"Interesting idea! Token consumption sure is an issue that should be addressed, and this is pretty funny too! However, I happen to have an unproven claim that tokens are units of thinking, and therefore, reducing the token count might actually reduce the model's capabilities. Did anybody using this by chance notice any degradation (since I did not bother to check myself)?"
Have a nice day!
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldnβt this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.
Mass fun. Starred.
I have a feeling these same people will complain βmy model is so dumb!β. Thereβs a reason why Claude had that βyouβre absolutely right!β for a while. Or codexβs βyouβre right to push on thisβ.
Weβre basically just gaslighting GPUs. That wall of text is kinda needed right now.
Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.
tldr; Claude skill, short output, ++good.
But combining this with caveman? Gold!
However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.
It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequences of a poor player, it'll generate a poor move since that's the best prediction.
Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.
But I assume this has been studied? Can anyone point to papers that show it? Iβd particularly like to know what the curves look like, itβs clearly not linear, so if you cut out 75% or tokens what do you expect to lose?
I do imagine there is not a lot of caveman speak in the training data so results may be worse because they donβt fit the same patterns that have been reinforcement learned in.
It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.
e.g.
"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).
All languages must have means for marking the syntactic roles of the words in a sentence.
The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, ni programming languages).
English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.
Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages.
Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".
Really? Has basic reading become a Herculean task?
It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.
> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.
So it was a sign of a "noob", rather than a mark of sophistication and literacy.
> cutting ~75% of tokens while keeping full technical accuracy.
I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
An explanation that explains nothing is not very interesting.
LLMs do stumble into long prediction chains that donβt lead the inference in any useful direction, wasting tokens and compute.
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.
You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.
https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777
First that scratchpads matter, then why they matter, then that they donβt even need to be meaningful tokens, then a conceptual framework for the whole thing.
I find LLM slop much harder to read than normal human text.
I can't really explain it, it's just a feeling.
The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.
There will likely be some internal reasoning going "I wonder if the user meant spell check, I'm gonna go with that one".
And it'll also bias the reasoning and output to internet speak instead of what you'd usually want, such as code or scientific jargon, which used to decrease output quality. I'm not sure if it still does
Did you test that ""caveman mode"" has similar performance to the ""normal"" model?
For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.
Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.
A lot of communication is just mentioning the concepts.
Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?
This is so funny
Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.
why use many token when few do trick
Install β’ Benchmarks β’ Before/After β’ Why
A Claude Code skill/plugin and Codex plugin that makes agent talk like caveman β cutting ~75% of tokens while keeping full technical accuracy.
Based on the viral observation that caveman-speak dramatically reduces LLM token usage without losing technical substance. So we made it a one-line install.
π£οΈ Normal Claude (69 tokens)
|
πͺ¨ Caveman Claude (19 tokens)
|
π£οΈ Normal Claude
|
πͺ¨ Caveman Claude
|
Same fix. 75% less word. Brain still big.
Real token counts from the Claude API (reproduce it yourself):
| Task | Normal (tokens) | Caveman (tokens) | Saved |
|---|---|---|---|
| Explain React re-render bug | 1180 | 159 | 87% |
| Fix auth middleware token expiry | 704 | 121 | 83% |
| Set up PostgreSQL connection pool | 2347 | 380 | 84% |
| Explain git rebase vs merge | 702 | 292 | 58% |
| Refactor callback to async/await | 387 | 301 | 22% |
| Architecture: microservices vs monolith | 446 | 310 | 30% |
| Review PR for security issues | 678 | 398 | 41% |
| Docker multi-stage build | 1042 | 290 | 72% |
| Debug PostgreSQL race condition | 1200 | 232 | 81% |
| Implement React error boundary | 3454 | 456 | 87% |
| Average | 1214 | 294 | 65% |
Range: 22%β87% savings across prompts.
[!IMPORTANT] Caveman only affects output tokens β thinking/reasoning tokens are untouched. Caveman no make brain smaller. Caveman make mouth smaller. Biggest win is readability and speed, cost savings are a bonus.
A March 2026 paper "Brevity Constraints Reverse Performance Hierarchies in Language Models" found that constraining large models to brief responses improved accuracy by 26 percentage points on certain benchmarks and completely reversed performance hierarchies. Verbose not always better. Sometimes less word = more correct.
npx skills add JuliusBrussee/caveman
Or with Claude Code plugin system:
claude plugin marketplace add JuliusBrussee/caveman
claude plugin install caveman@caveman
Codex:
/pluginsCavemanInstall once. Use in all sessions after that.
One rock. That it.
Trigger with:
/caveman or Codex $cavemanStop with: "stop caveman" or "normal mode"
| Thing | Caveman Do? |
|---|---|
| English explanation | πͺ¨ Caveman smash filler words |
| Code blocks | βοΈ Write normal (caveman not stupid) |
| Technical terms | π§ Keep exact (polymorphism stay polymorphism) |
| Error messages | π Quote exact |
| Git commits & PRs | βοΈ Write normal |
| Articles (a, an, the) | π Gone |
| Pleasantries | π "Sure I'd be happy to" is dead |
| Hedging | π "It might be worth considering" extinct |
βββββββββββββββββββββββββββββββββββββββ
β TOKENS SAVED ββββββββ 75% β
β TECHNICAL ACCURACY ββββββββ 100%β
β SPEED INCREASE ββββββββ ~3x β
β VIBES ββββββββ OOG β
βββββββββββββββββββββββββββββββββββββββ
Caveman not dumb. Caveman efficient.
Normal LLM waste token on:
Caveman say what need saying. Then stop.
If caveman save you mass token, mass money β leave mass star. β
MIT β free like mass mammoth on open plain.