Teaching Claude Why

Interesting read about Claude! Thanks for sharing.

Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!

If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?

If the answer is “yes”, our definition of alignment kind of sucks.

This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.

One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.

For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.

[0] https://github.com/p-e-w/heretic

Isn't alignment a dilemma?

Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.

Assuming rules and principles are something like first- and second- derivatives of optimized equations for a given domain, it makes sense to teach/train them in the context of derivation and integration. It would be fascinating to use existing case-based literature from e.g., business, law, or medicine for the training.

A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

Why do they have cancer research listed on these charts as a misalignment issue?

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...

Interesting read about Claude! Thanks for sharing.

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

Isn't alignment a dilemma?

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.

Yeah, that part is probably not done by Claude.

If the answer is “yes”, our definition of alignment kind of sucks.

[0] https://github.com/p-e-w/heretic

Why do they have cancer research listed on these charts as a misalignment issue?

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

> If the answer is “yes”, our definition of alignment kind of sucks.

Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"

Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"

* https://en.wikipedia.org/wiki/Loss_function

Jobs are an invention of humanity. About 50% of people dislike their job. People spend much of their lives working. Poverty and inequality are a choice made by society if society chooses poorly.

There's isn't even a solution for how to control highly capable systems at all, everyone wants to decide what to do with the AI before they've even solved the problem of controlling it.

It's like how everybody imagines their lives will be great once they're a millionare, but they have no plan for how to get there. It's too easy to get lost dreaming of solutions instead of actually solving the important problems.

The categories make no sense. Not having to do a job is the entire best case of AI. What we do with that is another thing, but we simply have to accept that any other lense is complete nonsense. The endpoint is obvious and we need to stop being silly about it: We are replacing human labor. Maybe we will find some new jobs to do in the interim. Maybe not. In the end, if everything goes right (in the AI optimist sense), jobs will not be something that humans do.

Labor = capital/energy in an AI complete world. We have to start from that basis when we talk about alignment or anything else. The social issues that arise from the extinction of human labor are something we have to solve politically, that's not something any model company can do (or should be allowed to do).

Is this some sort of “incompleteness” paradox for AI alignment? Seriously

This is completely why the rich love it so much

Why would the elimination of the value of labor result in poverty and inequality? It should be the opposite, as poverty and inequality is the current status quo (for the many).

This is radical life denial. I was not born for and do not exist to toil. Work is ontologically evil.

Maybe a sufficiently aligned AI would necessarily decide that the zeroth law was necessary, and abscond.

(I’m reading Look To Windward by Iain M. Banks at the moment and I just got to the aside where he explains that any truly unbiased ‘perfect’ AI immediately ascends and vanishes.)

this completely misses the point why alignment exists

Alignment exists to protect shareholder value.

If it creates industry wide outrage, shareholder value declines.

It making shareholders rich and other people poor won't.

It's a weird new thing. You might call it "AI psychology".

The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.

A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.

With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.

inb4 there will be a whole new field of research that is basically psychology / pedagogy for AI. Who will be the Sigmund Freud of AI?

Really interesting resource, thanks for sharing! It was not on my radar.

> https://github.com/chloeli-15/model_spec_midtraining

I'm a bit confused about this part:

> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.

> ANTHROPIC_API_KEY=sk-ant-...

> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true). # This will significantly reduce generation time high-volume generation. ANTHROPIC_BATCH_API_KEY=sk-ant-...

Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?

This is exactly where my brain went while reading the post. Just out of curiosity, where do you think we are on the speedrun? Have we passed the Body vs Soul view already? Do you think that as we move through history, religion will become more predominate in thought patterns or was that intrinsically human and just a sign of the times? How do we create an end product more Bernard Williams then Paul de Lagarde? All places my brain jumped to.

"Mainly, one suspects, to make the open models less ethical on demand"

Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.

As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!

> One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles.

Can you explain more about this?

Call me crazy, but I'm not sure I'd want to be the person building these kind of systems given A) how much increasing independence and power is being given to models like Claude and B) how incentivised they are to not allow their morals to be circumvented in this way.

The chart is complete and utter slop. But I guess their aligned AI didn't tell them that making up data is "not good" so how could they have known.

Cured patients don't count as recurring revenue? /s (but we know deep down it's not /s for some)

Yeah, that part is probably not done by Claude.

I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.

I would agree that 30% of my preference for Claude is because their default web/app interface uses an easy to read serif font with a calming color scheme.

Doesn't OpenAI have a higher general user base than Anthropic?