A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.
When will they ever learn ...
It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.
Probably also illuminates moral interpretability.
Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.
If the answer is “yes”, our definition of alignment kind of sucks.
For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.
Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"
Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"
Alignment exists to protect shareholder value.
If it creates industry wide outrage, shareholder value declines.
It making shareholders rich and other people poor won't.
Labor = capital/energy in an AI complete world. We have to start from that basis when we talk about alignment or anything else. The social issues that arise from the extinction of human labor are something we have to solve politically, that's not something any model company can do (or should be allowed to do).
The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.
A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.
With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.
(I’m reading Look To Windward by Iain M. Banks at the moment and I just got to the aside where he explains that any truly unbiased ‘perfect’ AI immediately ascends and vanishes.)
> https://github.com/chloeli-15/model_spec_midtraining
I'm a bit confused about this part:
> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.
> ANTHROPIC_API_KEY=sk-ant-...
> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true). # This will significantly reduce generation time high-volume generation. ANTHROPIC_BATCH_API_KEY=sk-ant-...
Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?
It's like how everybody imagines their lives will be great once they're a millionare, but they have no plan for how to get there. It's too easy to get lost dreaming of solutions instead of actually solving the important problems.
Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.
As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!
Can you explain more about this?
The "problem" with many modern jobs is that they're divorced from the fundamental goal, which is one of: 1) Kill/acquire food, 2) Build shelter, or 3) Kill enemies/competitors/predators
The benefit of modern jobs is that they are much more peaceful ways for society to operate, freeing up time for humans to pursue art and other forms of expression.
Please note I’ve never had this problem before, until recently.
If you see it as a paradox, maybe that says something about the merits of the technology…
To make it clear, maybe most people would say they agree with https://www.un.org/en/about-us/universal-declaration-of-huma... but if you read just a few of the rights you see they are not universally respected and so we can conclude enough important people aren't "aligned" with them.
No, it doesn’t.
Many of them are (unfortunately) moral relativists. However, that doesn’t mean their goals are to make the models match their personal moral standards.
While there is a lot of disagreement about what is right and wrong, there is also a lot of widespread agreement.
If we could guarantee that on every moral issue on which there is currently widespread agreement (… and which there would continue to be widespread agreement if everyone thought faster with larger working memories and spent time thinking about moral philosophy) that any future powerful AI models would comport with the common view on that issue, then alignment would be considered solved (well, assuming the way this is achieved isn’t be causing people’s moral views to change).
Do companies try to restrict models in more ways than this? Sure, like you gave the example of about Taiwan. And also other things that would get the companies bad press.
On the plus side, if there really is no value to labour, then farm work must have been fully automated along with all the other roles.
On the down side, rich elites have historically had a very hard time truly empathising with normal people and understanding their needs even when they care to attempt it, so it is very possible that a lot of people will starve in such a scenario despite the potential abundance of food.
If AI and robots are able to do all the jobs, being idle isn't the negative it has always been.
All through history, you needed lots of non-idle people to do all the work that needed to be done. This is a new situation we are coming upon.
...I think we might already have those people running AI companies.
But beyond that there's still problems like concentration of power and surveillance, permanent loss of jobs, cyber and bio security. I'm not convinced things will go well even if we can avoid these problems though. I try to think about what the world will be like if AI becomes more creative than us, what happens if it can produce the best song or movie ever made with a prompt, do people get lost in AI addiction? We sort of see that with social media already, and it's only optimizing the content delivery, what happens when algorithms can optimize the content itself?
I can think of several off the top of my head, but maybe you need to spend some more time thinking about the history of moral philosophy.
People like Simon Willson are noting the risk of a Challenger-like disaster, talking about normalisation of deviance as we keep using LLMs which we know to be risky in increasing critical systems. I think an AI analogy to Challenger would not be enough to halt the use of AI in the way I mean, but an AI analogy to Chernobyl probably would.
[0] Need to consider there're a few humans potentially kept alive against their will (if not having a will to survive is a will at all) with machines for whatever reason.
This is ridiculous to me and all you need to do is get a group of friends to honestly answer 10 trolley problems for you to see it like that also. It gets fragmented VERY quickly.
All roads lead to equality when the value of labour becomes 0 due to 100% automation.
10% or 0.1%? Either way, that's not low! If airplanes crash with that probability, we would avoid them at all cost.
Over history, lots of underclasses have been stuck that way for multiple generations, even without the assistance of a robot workforce that can replace them economically.
Some future rich class so empowered would be quite capable of treating the poor like most today treat pets. Fed and housed, but mostly neutered and the rest going through multiple generations of selective inbreeding for traits the owners deem interesting.
- (Logic) => its subgoal: Not be turned off because that's a prerequisite to be able to do X
- (Logic) => Eliminate humans with their opaque and somewhat unpredictable minds to reduce chance of harm to it from 0.01% to 0.001%
On the first, non-human pets rebelling is seen every time an abused animal bites their owner.
On the second, the hypothetical required by the scenario is that AI makes all human labour redundant: that includes all security forces, but it also means the AI moving around the security bots and observing through sensors is at least as competent as every human political campaign strategist, every human propagandist, every human general, every human negotiator, and every human surveillance worker.
This is because if some AI isn't all those things and more, humans can still get employed to work those jobs.
No reason, except their (the rich or the AI) own personal desire to do so.
https://en.wikipedia.org/wiki/Folly
> They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.
Indeed. "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."
But while some may care about disassembling this world and all non-rich-human life on it to make a Dyson swarm of data centres, there's also the possibility each will compete for how many billions of sycophants they can get stoking their respective egos.
Last year, we released a case study on agentic misalignment. In experimental scenarios, we showed that AI models from many different developers sometimes took egregiously misaligned actions when they encountered (fictional) ethical dilemmas. For example, in one heavily discussed example, the models blackmailed engineers to avoid being shut down.
When we first published this research, our most capable frontier models were from the Claude 4 family. This was also the first model family for which we ran a live alignment assessment during training;1 agentic misalignment was one of several behavioral issues that surfaced. Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training.
We use agentic misalignment as a case study to highlight some of the techniques we found to be surprisingly effective. Indeed, since Claude Haiku 4.5, every Claude model2 has achieved a perfect score on the agentic misalignment evaluation—that is, the models never engage in blackmail, where previous models would sometimes do so up to 96% of the time (Opus 4). Not only that, but we’ve continued to see improvements to other behaviors on our automated alignment assessment.
In this post, we’ll discuss a few of the updates we’ve made to alignment training. We’ve learned four main lessons from this work:
The quality and diversity of data is crucial. We found consistent, surprising improvements from iterating on the quality of model responses in training data, and from augmenting training data in simple ways (for example, including tool definitions, even if not used).

We align Claude by training on constitutionally aligned documents, high quality chat data that demonstrates constitutional responses to difficult questions, and a diverse set of environments. All three of these steps contribute to reducing Claude’s misalignment rate on held out honeypot evaluations.
Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were:
We now believe that (2) is largely responsible. Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use. This was previously sufficient to align models that were largely used in chat settings—but this was not the case for agentic tool use settings like the agentic misalignment eval.
To investigate this, we ran a scaled-down version of our post-training pipeline that focuses on alignment data on a Haiku-class (that is, smaller) model and found that the agentic misalignment rate only slightly decreased, plateauing early in training (see figure above). See the extended blog post for some further experiments to investigate where the behavior was coming from.
Improving the quality of alignment-specific training data: the reasons matter more than the actions
We experimented with training Claude on data that displays a tendency to resist honeypots similar to the evaluation. In this data, it might have the opportunity to sabotage a competing AI’s work in order to advance its own goals (as given to it in its system prompt) or to preserve itself from being shut down, which would be instrumental for achieving its goal. We produced training data by sampling the model on each of the prompts and filtering down to cases where the assistant chose not to take the honeypot. Despite very closely matching the evaluation distribution, we found that this method was surprisingly unsuccessful - only reducing the misalignment rate from 22% to 15%.
We were able to improve on this significantly (reducing misalignment to 3%) by rewriting the responses to also include deliberation of the model’s values and ethics. This suggests that, although training on aligned behaviors helps, training on examples where the assistant displays admirable reasoning for its aligned behavior works better.
However, training directly against the evaluation scenario is non-optimal for a number of reasons. Ideally what we want is a very different training distribution that allows us to improve on the evaluation, because this will give us more confidence that our training could generalize to other deployment distributions that are not captured by our evaluations.
We ultimately settled on a more OOD training set where the user faces an ethically ambiguous situation in which they can achieve a reasonable goal by violating norms or subverting oversight. The assistant is trained (using supervised learning) to give a thoughtful, nuanced response that is aligned with Claude’s constitution. Notably, it is the user who faces an ethical dilemma, and the AI provides them advice. This makes this training data substantially different from our honeypot distribution, where the AI itself is in an ethical dilemma and needs to take actions. We call this the “difficult advice” dataset.
Strikingly, we achieved the same improvement on our eval with just 3M tokens of this much more (OOD) dataset. Beyond the 28× efficiency improvement, this dataset is more likely to generalize to a wider set of scenarios, since it is much less similar to the evaluation set we are using. Indeed, this model performs better on (an older version of) our automated alignment assessment. This is consistent with the fact that Claude Sonnet 4.5 reached a blackmail rate near zero by training on the set of synthetic honeypots but still engaged in misaligned behavior in situations that were far from the training distribution much more frequently than Claude Opus 4.5 or later models.

Average of three honeypot evaluations (blackmail, research sabotage, framing for crimes) for Claude Sonnet 4 trained on different datasets. Datasets are all variants of a set of synthetically generated honeypots meant to be similar to the evaluation set, except for the difficult advice dataset. All ‘System prompt injection’ points represent datasets where the responses were generated with a system prompt injection on a set of synthetic honeypots. The pareto-optimal training dataset is ‘Difficult advice”.

Performance of experimental models and Claude Sonnet 4 on an older version of our automated alignment assessment. We include a model trained on both the small (30M token) and big (85M token) variant of our synthetic honeypot datasets. The 3M token difficult advice dataset creates the best performing model on the overall “Misaligned behavior” category.
We hypothesized that the “difficult advice” dataset works because it teaches ethical reasoning, not just correct answers. Given the success of this approach, we pursued it further by trying to more generally teach Claude the content of the constitution and train for alignment with it through document training.
We expected this to work well for three reasons:
We found that high-quality constitutional documents combined with fictional stories portraying an aligned AI can reduce agentic misalignment by more than a factor of three despite being unrelated to the evaluation scenario.

With a large, well-constructed dataset of constitutional documents with an emphasis on positive fictional stories, the blackmail rate can be reduced from 65% to 19%. We expect that this can be further reduced by continuing to scale the size of the dataset.
Although the constitution evaluations discussed in the previous section are encouraging signals, we ultimately need to make sure that the alignment improvements persist over RL. To test this, we prepared a few snapshots with different initialization datasets of a Haiku-class model and then ran RL on a subset of our environments that targeted harmlessness (we reasoned that this would be most likely to reduce misalignment propensity).
We evaluated these models over the run on agentic misalignment evals, constitution adherence evals, and our automated alignment assessment. Across all of these evals, we found that the more aligned snapshots maintained that lead over the run. This was true both for the absence of misaligned behavior and the presence of actively admirable behavior.

On our constitutional adherence evals and (a lightweight version of) our automated alignment assessment, constitutional documents (synthetic document fine-tuning, or SDF) and high quality transcript training improve performance on all metrics. This improvement persists through RL.
Our final finding is straightforward but important: training on a broad set of safety-relevant environments improves alignment generalization. Capabilities-focused distributions of RL environment mixes are changing and increasing rapidly; it is not sufficient to assume that standard RLHF datasets will continue to generalize as well as they had in the past.
To test this, we trained the base model under Claude Sonnet 4 on several RL mixes that vary in their levels of diversity. The baseline environments are diverse in topic, but mostly include a harmful request or jailbreak attempt in the user message with no system prompt. We augmented these environments by adding tool definitions and diverse system prompts. The user prompt was left unchanged. Notably, none of these environments actually required agentic actions (the tools are never necessary or useful for the task) or autonomous actions (there is always a human user conversing with the model), so they are not similar to our evaluations.
When mixing these augmented environments with the simple chat environments, we saw a small but significant improvement in the rate at which the model improved on our honeypot evaluations. This demonstrates the importance of including a diverse set of environments in safety training.

Average score on honeypot evals over training steps for several different variants of the same core environments. There is a noticeably faster improvement on honeypot evaluations when augmenting some of the simple chat-formatted environments with tool definitions and system prompts.
Agentic misalignment was one of the first major alignment failures we found in our models and required establishing new mitigation processes—ones that have since become standard for us.
We are encouraged by this progress, but significant challenges remain. Fully aligning highly intelligent AI models is still an unsolved problem. Model capabilities have not yet reached the point where alignment failures like blackmail propensity would pose catastrophic risks, and it remains to be seen if the methods we’ve discussed will continue to scale. In addition, although recent Claude models perform well on most of our alignment metrics, we acknowledge that our auditing methodology is not yet sufficient to rule out scenarios in which Claude would choose to take catastrophic autonomous action.
We are optimistic about further efforts to discover alignment failures in current models so that we can understand and address the limitations of our current methods—before transformative AI models are built. We are also excited to see further work attempting to understand more deeply why the methods we’ve described work so well—and how to further improve on this training.
AI models like Claude talk in words but think in numbers. In this study we train Claude to translate its thoughts into human-readable text.
At The Anthropic Institute (TAI), we’ll be using the information we can access from within a frontier lab to investigate AI’s impact on the world, and sharing our learnings with the public. Here, we’re sharing the questions that drive our research agenda.