Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.

I.e. by demanding the model to be concise, you're literally making it dumber.

(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)

Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.

Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.

Soma (aka tiktok) and Big Brother (aka Meta) already happened without government coercion, only makes sense that we optimize ourselves for newspeak.

Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.

Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.

Also see https://arxiv.org/pdf/2604.00025 ('Brevity Constraints Reverse Performance Hierarchies in Language Models' March 2026)

Kinda ironic this description is so verbose.

> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman

For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?

See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output

For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills

More like Pidgin English than caveman, perhaps, although caveman does make for a better name.

So, if this does help reduce the cost of tokens, why not go even further and shorten the syntax with specific keywords, symbols and patterns, to reduce the noise and only keep information, almost like...a programming language?

If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.

I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.

This is the best thing since I asked Claude to address me in third person as "Your Eminence".

But combining this with caveman? Gold!

This is an experiment that, although not to this extreme, was tested by OpenAI. Their responses API allow you to control verbosity:

https://developers.openai.com/api/reference/resources/respon...

I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.

There's linguistic term for this kind of speech: isolating grammars, which don't decline words and use high context and the bare minimum of words to get the meaning across. Chinese is such a language btw. Don't know what Chinese think about their language being regarded as cavemen language...

Wouldn't this affect quality of output negatively?

Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.

I think this could be very useful not when we talk to the agent, but when the agents talk back to us. Usually, they generate so much text that it becomes impossible to follow through. If we receive short, focused messages, the interaction will be much more efficient. This should be true for all conversational agents, not only coding agents.

By the way why don't these LLM interfaces come with a pause button?

Great idea- if the person who made it is reading: Is this based on the board game „poetry for cavemen“? (Explain things using only single-syllable words, comes even with an inflatable log of wood for hitting each other!)

This trick reminds me of "OpenAI charges by the minute, so speed up your audio"

https://news.ycombinator.com/item?id=44376989

Are there any good studies or benchmarks about compressed output and performance? I see a lot of arguing in the comments but little evidence.

You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.

I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.

Or you could use a local model where you’re not constrained by tokens. Like rig.ai

So it's a prompt to turn Jarvis into Hulk!

No articles, no pleasantries, and no hedging. He has combined the best of Slavic and Germanic culture into one :)

APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.

What is that binary file caveman.skill that I cannot read easily, and is it going to hack my computer.

While really useful now, I'm afraid that in the long run it might accelerate the language atrophy that is already happening. I still remember that people used to enter full questions in Google and write SMS with capital letters, commas and periods.

I would prefer to talk like Abathur (https://www.youtube.com/watch?v=pw_GN3v-0Ls). Same efficiency but smarter.

> If caveman save you mass token, mass money — leave mass star.

Mass fun. Starred.

This is exactly what annoys me most. English is not suitable for computer-human interaction. We should create new programming and query languages for that. We are again in cobol mindset. LLM are not humans and we should stop talking to them as if they are.

I didn’t comment on this when I saw it on threads/twitter. But it made it to HN, surprisingly.

I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.

We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.

Oh, another new trend! I love these home-brewed LLM optimizers. They start with XML, then JSON, then something totally different. The author conveniently ignores the system prompt that works for everything, and the extra inference work. So, it's only worth using if you just like this response style, just my two cents. All the real optimizations happen during model training and in the infrastructure itself.

Caveman need invent chalk and chart make argument backed by more than good feel.

grug have to use big brains' thinking machine these days, or no shiny rock. complexity demon love thinking machine. grug appreciate attempt to make thinking machine talk on grug level, maybe it help keep complexity demon away.

LOL it actually reads how humans reply the name is too clever :').

Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.

tldr; Claude skill, short output, ++good.

Deep digging cave man code reviews are Tha Shiznit:

https://www.youtube.com/watch?v=KYqovHffGE8

caveman multilingo? how sound?

I don't know about token savings, but I find the "caveman style" much easier to read and understand than typical LLM-slop.

I was actually worried about high token costs while building my own project (infra bundle generator), and this gave me a good laugh + some solid ideas. 75% reduction is insane. Starred

I'd be curious if there were some measurements of the final effects, since presumably models wont <think> in caveman speak nor code like that

I.e. by demanding the model to be concise, you're literally making it dumber.

(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)

Let me rephrase that for you:

"Interesting idea! Token consumption sure is an issue that should be addressed, and this is pretty funny too! However, I happen to have an unproven claim that tokens are units of thinking, and therefore, reducing the token count might actually reduce the model's capabilities. Did anybody using this by chance notice any degradation (since I did not bother to check myself)?"

Have a nice day!

Let me rephrase that for you:

Have a nice day!

That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't limited to the number of tokens in the input.

However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.

It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequences of a poor player, it'll generate a poor move since that's the best prediction.

Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.

I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.

But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?

I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.

What do you mean? The page explicitly states:

> cutting ~75% of tokens while keeping full technical accuracy.

I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

An explanation that explains nothing is not very interesting.

Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.

This is condescending and wrong at the same time (best combo).

LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.

I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.

Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.

A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.

More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.

I wonder if a language like Latin would be useful.

It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.

e.g.

"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).

Ah so obviously making the LLM repeat itself three times for every response it will get smarter

Soma (aka tiktok) and Big Brother (aka Meta) already happened without government coercion, only makes sense that we optimize ourselves for newspeak.

Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.

Also see https://arxiv.org/pdf/2604.00025 ('Brevity Constraints Reverse Performance Hierarchies in Language Models' March 2026)

Kinda ironic this description is so verbose.

> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman

For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?

See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output

For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills

More like Pidgin English than caveman, perhaps, although caveman does make for a better name.

If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.

This is an experiment that, although not to this extreme, was tested by OpenAI. Their responses API allow you to control verbosity:

https://developers.openai.com/api/reference/resources/respon...

I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.

Wouldn't this affect quality of output negatively?

Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.

Are there any good studies or benchmarks about compressed output and performance? I see a lot of arguing in the comments but little evidence.

Or you could use a local model where you’re not constrained by tokens. Like rig.ai

So it's a prompt to turn Jarvis into Hulk!

APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.

What is that binary file caveman.skill that I cannot read easily, and is it going to hack my computer.

I would prefer to talk like Abathur (https://www.youtube.com/watch?v=pw_GN3v-0Ls). Same efficiency but smarter.

> If caveman save you mass token, mass money — leave mass star.

Mass fun. Starred.

I didn’t comment on this when I saw it on threads/twitter. But it made it to HN, surprisingly.

We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.

Caveman need invent chalk and chart make argument backed by more than good feel.

LOL it actually reads how humans reply the name is too clever :').

Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.

tldr; Claude skill, short output, ++good.

Deep digging cave man code reviews are Tha Shiznit:

https://www.youtube.com/watch?v=KYqovHffGE8

caveman multilingo? how sound?

I don't know about token savings, but I find the "caveman style" much easier to read and understand than typical LLM-slop.

I was actually worried about high token costs while building my own project (infra bundle generator), and this gave me a good laugh + some solid ideas. 75% reduction is insane. Starred

I'd be curious if there were some measurements of the final effects, since presumably models wont <think> in caveman speak nor code like that

Let’s see, I think these pretty much map out a little chronology of the research:

https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777

First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.

the fact that more tokens = more smart should be expected given cot / thinking / other techniques that increase the model accuracy by using more tokens.

Did you test that ""caveman mode"" has similar performance to the ""normal"" model?

Can't you know that tokens are units of thinking just by... like... thinking about how models work?

No, let me rephrase it for you. “tokens used for think. Short makes model dumb”

Yes because in most contexts it has seen "caveman" talk the conversations haven't been about rigorously explained maths/science/computing/etc... so it is less likely to predict that output.

Fluff adds probable likeness. Probablelikeness brings in more stuff. More stuff can be good. More stuff can poison.

Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.

If you take meaningless tokens (that do not contribute to subject focus), I don't see what you would lose. But as this takes out a lot of contextual info as well, I would think it might be detrimental.

Good point and it's actually worse than that : the thinking tokens aren't affected by this at all (the model still reasons normally internally). Only the visible output that gets compressed into caveman... and maybe the model actually need more thinking tokens to figure out how to rephrase its answer into caveman style

This is the best thing since I asked Claude to address me in third person as "Your Eminence".

But combining this with caveman? Gold!

The fact whether a language is isolating, or not, is independent on the redundancy of the language.

All languages must have means for marking the syntactic roles of the words in a sentence.

The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, ni programming languages).

English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.

Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages.

I thought the term for those were 'sane languages', and I say that as a native English speaker :)

That’s what it does as far as I get it. But less is not always better and I guess it’s also subjective to the promoter.

> Usually, they generate so much text that it becomes impossible to follow through.

Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".

Really? Has basic reading become a Herculean task?

By the way why don't these LLM interfaces come with a pause button?

And a "prune here" button.

It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.

i imagine they're doing superman level distributed compute across multiple clouds somewhere and cared more about delivering the final result of that than having the ability to pause. which is probably possible, but would require way more work than would be worthwhile. they probably thought the ability to stop and resubmit would be an adequate substitute.

This trick reminds me of "OpenAI charges by the minute, so speed up your audio"

https://news.ycombinator.com/item?id=44376989

Which worked great. Also, cut off silences.

> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.

You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.

Wouldn't this increase your token usage because the tokenizer now can't process whole words, but it needs to go letter by letter?

I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.

It's especially funny to change your coworker's system prompt like that.

No articles, no pleasantries, and no hedging. He has combined the best of Slavic and Germanic culture into one :)

Both Slavic languages and German have complex declination systems for nouns, verbs, and adjectives. Which is unlike stereotypical caveman speech.

> I still remember that people used to enter full questions in Google

I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.

So it was a sign of a "noob", rather than a mark of sophistication and literacy.

Grug says Chinese more suitable, only few runes in word, each take single token. Is great.