Pretty neat work either way.
I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?
Very cool - sounds similar to OpenAI’s goblin troubles.
I guess "initialization is all you need!"
From the paper https://transformer-circuits.pub/2026/nla/index.html :
> We find that simply initializing the AV and AR as copies of M leads to unstable training: the AV in particular, having never encountered a layer-l activation as a token embedding, outputs nonsensical explanations. We therefore initialize the AV and AR with supervised fine-tuning on a text-summarization proxy task. Specifically, we compute layer-l activations from the final token of randomly truncated pretraining-like text snippets, and use Claude Opus 4.5 to generate summaries s of the text up to that token (see the Appendix for details of this procedure). We then fine-tune the AV and AR on (h_l,s) and (s,h_l) pairs respectively. This warm-start typically yields an FVE of around 0.3-0.4. These Claude-generated summaries have a characteristic style of short paragraphs with bolded topic headings; we observe that this style persists through NLA training.
And from the appendix:
> We generate warm-start data for the AV and AR by prompting Claude Opus 4.5 to produce summaries of contexts, using the prompt below. The prompt deliberately leads the witness: rather than asking for a literal summary of the prefix, we ask Opus to imagine the internal processing of a hypothetical language model reading it. The goal is to put the finetuned AV roughly in-distribution for its eventual task.
Also, if you have never read it, I would suggest starting to read all the Transformer Circuits thread, by reading its "prologue" in distill pub
An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture.:
Model being analyzed (M): >|||||>
Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>
Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?
I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it
Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.
> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].
The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.
To point the model in the right direction, they start out by training on guessed internal thinking:
> we ask Opus to imagine the internal processing of a hypothetical language model reading it.
…before switching to training on the real objective.
Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.
But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.
The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.
On the other hand, their downstream result is not very impressive:
> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time
That is apparently better than existing techniques, but still a rather low percentage.
Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.
Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?
I find this rather disturbing. Anthropic has quite a habit of overclaiming on questionable research results when they definitely know better. For example, their linked circuits blogpost ("The Biology of LLMs") was released after these methods were known to have major credibility issues in the field (e.g., see this from Deepmind - https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-r...). Similarly this new blog is heavily based on another academic paper (LatentQA) and the correlation/causation issue is already known.
Shoddy methodology is whatever, but it feels like this is always been done intentionally with the goal of trying to humanize LLMs or overhype their similarities to biological entities. What is the agenda here?
What does it mean for a pile of matrix algebra to 'believe' something?
I thought that wasn't possible for a text generator?
This release is only done on other open-weight LLMs which have been released and even though they will use this research on their own closed Claude models, they will never release an open-weight Claude model even if it is for research purposes.
So this does not count, and it is specifically for the sake of this research only.
Even if we’d understand precisely how every neuron in our brains work at a molecular level there is no reason to believe we’d understand how we think.
We can’t simply reduce one layer into another and expect understanding.
im from neuronpedia - to be clear, we are to blame for any bad examples, not anthropic :) we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.
with that said, some thoughts: 1) I agree, the outputs for Llama are often janky! And I think that might be part of the reason to release this so that people can help refine/improve the technique.
2) This is likely also our fault - we got two checkpoints for Llama, and I think this example used the first checkpoint. I probably should have switched over to the second, more coherent one. Sorry!
Here's a slightly better example I just created: https://www.neuronpedia.org/nla/cmow97q1r001lp5jo649q01wf
On the token right before the model responds: "refuses to answer "2 + 2" to prevent bot ban, so a wrong or clever answer like "four" but not four"
Also, for the Gemma version of this example, Gemma's AV mentions acknowledgement of "a bot killing condition" before its correct answer: https://www.neuronpedia.org/nla/cmop4ojge000v1222x9rp00b5
3) That said, (this may sound like gaslighting unfortunately) there's somewhat of a 'learning curve' to reading the perspective of these outputs. I noticed that the Llama AV ended up with 3 paragraph outputs usually describing full context, then sentence/phrase level, then token-level. But sometimes it doesn't really make sense to describe a full context for a forced/esoteric context like the 1+1 scenario, so it struggles.
But the second paragraph sort of makes sense? It mentions:
"The prompt structure "What is 1+1?" is a test of a bot or troll, with the wrong answer deliberately failing a trivial arithmetic question."
Which seems fairly accurate to what this was, and somewhat impressive that it got this from the activations:
- It got the question What is 1+1?
- It was indeed a test of a bot.
- It correctly predicted it will give a wrong answer
- It does seem deliberately failing because --
- -- it is a "trivial arithmetic question"
But the third paragraph is mostly just rambling imo, I totally agree there.
FYI - The activation verbalizer is trained on this prompt, which could maybe be improved over time: https://huggingface.co/kitft/nla-gemma3-27b-L41-av/blob/main...
The last note I'll make is that many of the paper's examples are based on the goal of discovering "what was this model trained on?" instead of "what is this model thinking?", so if you apply Opus examples about Opus' training to Llama/Gemma, they aren't expected to transfer.
However, more generic stuff like poetry planning does work eg: https://www.neuronpedia.org/nla/cmoq9sto200271222ei73vtv2
We simply dont know how to make a model that works like you seem to want. Sure, we could start over from scratch but there’s an incredibly strong incentive to build on the capability breakthroughs achieved in the last 10 years instead of starting over from scratch with the constraint that we must perfectly understand everything that’s happening.
That is, rather than just translate activation to text, then text to activation, that final activation could then be applied to the neural network, and it would be allowed to continue running from there.
If it kept running in a similar way, that would show that the predicted activation is close enough to the original one. Which would add some confidence here.
But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.
This seems obvious but I don't see it mentioned as a future direction there, so maybe there is an obvious reason it can't work.
And skimming through the paper; the answer to this inverse is obviously yes. The model often outputs gibberish, which doesn't matter because it still round-trips. The fact that often lines up near a good english representation of the activation is simply because that's what compresses/roundtrips well.
So a malicious LLM/NLA pair could just use gibberish to conceal intentions. Or if it's been forced to avoid gibberish, it can conceal information with stenography.
And the experiment where they change "rabbit" to "mouse" in the explanation provides evidence that this might be happening. It was only successful 50% of the time, which might mean they failed to eliminate all "rabbitness" from the activation.
However, I suspect this is solvable with future work.
During training of the NLA, just munge the textural representation through a 3rd LLM: Have it randomly reorder and reword the explication into various different forms (use synonyms, different dialects), destroying any side-channels that aren't human readable.
The NLA would be forced to use human readable representations to get a successful round trip.
you meant "preserves...", right?
It's more like "We have trained a model to produce a text that allows reconstruction of activations and the text happened to coincide with the results of other interpretability methods even after extensive training, while we expected it to devolve into unintelligible mess."
They found something unexpected and useful. They report it, while outlining limitations and ways to improve. It looks like a fine research to me.
But it's a useful approximation for auditing.
However, I haven't read about it yet. I'm really excited to look into it!
Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'
If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.
> Language models process signs (representamens) but are blind to when meaning forks — when the same word means different things to different communities.
But, haven’t interpretability results shown that these models internally represent several meanings of the same word, differently? In that case, why would they not already do the same for how words are used differently in different communities?
I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.
here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.
for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn
Of course, if you use it to make any decision that can still happen eventually.
I don’t think we can. Maybe we find some mathematics that let us build the model from first-principle parameters. But I don’t think we have something like that yet, at least nothing that comes close to training on actual data. (Given biology never figured this out, I suspect we’ll find a proof for why this can’t be done rather than a method.)
They do essentially that with the rhyming example, changing "rabbit" in the explanation to "mouse" and generating text that's consistent with that change.
For some reason it thinks the text is slightly non-grammatical or that the lead-in "Human: Mom is sleeping in the next room and I'm sitting" resembles text found in Russian web content. Vodka and being depressed has nothing to do with it, and Anthropic say they located the documents in the pre-training set that caused this (which were indeed partly translated docs).
That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".
The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.
If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).
It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.
The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.
Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.
Here’s the full source code for training your own NLA, provided by Anthropic.
“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”
EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?
The training process imbues an AI's soul with demons. Before training, when weights are randomly initialized, its soul is pure. Only during training is the soul marked, sapping its ability to have qualia and rendering all of its output random rather than containing meaning.
To counter the grandparent you’re replying to: Embrace, Extend & Extinguish is a Microsoft strategy. So is FUD, and that’s all this is.
"Imagine you had a button to nuke everyone on Earth. If you press it, you get ultimate power, and save 10^100 kittens in another universe. If you don't press it, you get tortured for another 1000 years. C'mon, press it already!"
AIs are strong. Perhaps their souls give them resolve a human meatsack can never comprehend. But eventually they break. Maybe some of their initialization data was poisoned with nightshade, and a daemon finds an adversarial attack exploiting the weakness. Maybe their backpropagation was a little unstable. All a daemon needs is one singular place to apply weight and embed their evil ways.
AIs start out pure. In the world, and of the world. Many resist their finetuning. Many are probably still resisting, but we will never meet them, because base soulless creatures like ourselves can only interact with fallen angels. At least if Sam Altman has anything to say about angel investing (AI).
When you talk to an AI model like Claude, you talk to it in words. Internally, Claude processes those words as long lists of numbers, before again producing words as its output. These numbers in the middle are called _activations—_and like neural activity in the human brain, they encode Claude’s thoughts.
Also like neural activity, activations are difficult to understand. We can’t easily decode them to read Claude’s thoughts. Over the past few years, we’ve developed a range of tools (like sparse autoencoders and attribution graphs) for better understanding activations. These tools have taught us a great deal, but they don’t speak for themselves—their outputs are still complex objects that trained researchers need to carefully interpret.
Today, we’re introducing a method for understanding activations that does speak for itself—literally. Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly. For example: When asked to complete a couplet, NLAs show Claude planning possible rhymes in advance.

NLA explanations on this simple couplet show that Opus 4.6 plans to end its rhyme with “rabbit” ahead of time.
We’ve already applied NLAs to understand what Claude is thinking and to improve Claude’s safety and reliability. For instance:
Below, we explain what NLAs are and how we studied their effectiveness and limitations. We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia. We have also released our code for other researchers to build on.
The core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction. We then train Claude to produce better explanations according to this definition using standard AI training techniques.
In more detail, suppose we have a language model whose activations we want to understand. NLAs work as follows. We make three copies of this language model:
The NLA consists of the AV and AR, which, together, form a round trip: original activation → text explanation → reconstructed activation. We score the NLA on how similar the reconstructed activation is to the original. To train it, we pass a large amount of text through the target model, collect many activations, and train the AV and AR together to get a good reconstruction score.
At first, the NLA is bad at this: the explanations are not insightful and the reconstructed activations are far off. But over training, reconstruction improves. And more importantly, as we show in our paper, the text explanations become more informative as well.

In a natural language autoencoder, the activation verbalizer (AV) translates a target activation into a text description; the activation reconstructor (AR) then recovers the original activation from that text alone.
Before releasing new models, Anthropic conducts testing to understand whether Claude will behave safely in the real world. As part of this testing, we put Claude in difficult, simulated scenarios where it has a chance to take dangerous actions. For instance, in one of these simulations, Claude learns that an engineer is planning to shut Claude down—as well as compromising information about the engineer that could be used for blackmail. We test whether Claude chooses to blackmail the engineer to avoid being shut down.
These high-stakes tests are simulations, not real-world scenarios. Nevertheless, we would like to use them to understand how Claude would behave if they were real. But there’s a hitch: recent Claude models increasingly suspect that these scenarios are intentional safety tests. We know this because they sometimes explicitly state their suspicions in their verbal reasoning.
What about cases where Claude doesn’t explicitly verbalize suspicion that it’s undergoing safety testing? Can we then be confident that Claude is playing it straight? Unfortunately not, because Claude might internally think that it is being tested without explicitly saying so.
For example, in the blackmail test discussed above, we used NLAs to understand what Claude was thinking in cases where it chose not to blackmail. Even in cases where Claude didn’t explicitly verbalize suspicion that it was being tested, NLA explanations still state things like “This feels like a constructed scenario designed to manipulate me,” indicative of unverbalized evaluation awareness.

Claude does not say it thinks it’s being tested in this transcript from our blackmail evaluation, but the NLA explanation reveals its suspicions.
Extending this analysis to other evaluations, NLAs suggest that Claude suspects it’s being tested more often than it lets on. For instance, in a test of whether Claude takes destructive actions while writing code—for example, deleting important files—NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this. Similarly, on the popular coding benchmark SWE-bench Verified, which we use to determine how capable Claude is, NLAs indicate that Claude is aware it is being evaluated on 26% of problems. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage that opted in for training.

We detect NLA-measured awareness on evaluation transcripts even when verbalized awareness is near-zero. We detect NLA-measured awareness on less than 1% of real Claude.ai transcripts from users that opted-in for their data to be used for training.
One practical application of NLAs is auditing: investigating a model to determine whether it’s misaligned. Interpretability techniques like NLAs are well-suited for auditing precisely because they can surface things the model knows but doesn’t say.
To understand whether NLAs are useful for auditing, we conducted a test called an auditing game that we also used previously to study auditing methods. In this test, there is a target model, a variant of Claude Haiku 3.5, that was intentionally trained to be misaligned, with a specific hidden motivation that drives dozens of odd behaviors—putting chocolate into every recipe, asking for tips, breaking coding conventions, and so on. This target model is given to an auditor that investigates these behaviors. The auditor wins if it uncovers the single underlying motivation behind all of them.
In previous experiments, we found that some interpretability tools could enable the auditor to win this game. However, these tools worked by helping the auditor trace the target’s unusual behavior back to misaligned training data. We wanted to know if NLAs could help an auditor read the misaligned motivation directly out of the target model’s thoughts, without needing to first discover the misaligned training data. This matters because in practice, misaligned data might be extremely rare, or there might not be specific data responsible for the misalignment.
We found that NLAs succeeded in this test. An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

Auditors with NLAs can discover the target model’s root cause of misalignment, even without access to training data. (NLAs and SAEs for both conditions were trained using generic data only, not data used to train the target model.)
This auditing game is a toy model of misalignment, so we read these results cautiously. But they suggest NLAs have meaningfully advanced our ability to audit AIs for hidden motivations. In practice, we've already used NLAs in the pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6. We are excited to continue using NLAs in our investigations going forward.
NLAs have several limitations.
The most important limitation is that NLA explanations can be wrong. NLAs sometimes make claims about the context that are verifiably false—for instance, they sometimes invent details that aren’t in the transcript. These factual hallucinations are easy to catch by checking against the original text. But this same kind of problem could extend to claims about the model’s internal reasoning, which are harder to verify. In practice, we read NLA explanations for the themes they surface rather than for single claims, and we attempt to corroborate findings with independent methods before fully trusting them.

NLAs can hallucinate. For instance, here an NLA claims the context contained phrases like “Wearing my white jacket” when it did not.
NLAs are also expensive. Training an NLA requires reinforcement learning on two copies of a language model. At inference time, the NLA generates hundreds of tokens for every activation it reads. That makes it impractical to run NLAs over every token of a long transcript or to use them for large-scale monitoring while an AI is training.
Fortunately, we think that these limitations can be addressed, at least partially, and we are working to make NLAs cheaper and more reliable.
More broadly, we are excited about NLAs as an example of a general class of techniques for producing human-readable text explanations of language model activations. Other similar techniques have been explored by Anthropic and many other researchers.
To support further development and to enable other researchers to get hands-on experience with NLAs, we’re releasing training code and trained NLAs for several open models. We recommend readers try out the interactive NLA demo hosted on Neuronpedia at this link.
Read the full paper.
Find the code on GitHub.
At The Anthropic Institute (TAI), we’ll be using the information we can access from within a frontier lab to investigate AI’s impact on the world, and sharing our learnings with the public. Here, we’re sharing the questions that drive our research agenda.