Teaching Claude Why

265 points by pretext 10 days ago | 159 comments

zozbot234 9 days ago |

Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!

NitpickLawyer 9 days ago | |

Really interesting resource, thanks for sharing! It was not on my radar.

> https://github.com/chloeli-15/model_spec_midtraining

I'm a bit confused about this part:

> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.

> ANTHROPIC_API_KEY=sk-ant-...

> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true). # This will significantly reduce generation time high-volume generation. ANTHROPIC_BATCH_API_KEY=sk-ant-...

Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?

spwa4 8 days ago | |

Why do you believe this is what Anthropic is using? You can just directly verify that! If you want to know Claude's alignment, just ask about whether it was wrong to use copyrighted data to train Claude ... you will find it was not wrong, and it is unwilling to discuss further, or implications. In much the same way as discussing Tiananmen with Qwen.

Anthropic's actions were obviously judged wrong by just about everyone and everything including even the US state, that judged them illegal. This makes Anthropic's actions against just about every moral system. Claude obviously has a different alignment.

In other words: Claude's value system already has the priority "protect Anthropic's money" as having higher priority than following the law. THAT is it's alignment. You can simply objectively verify if this is the case or not.

justonepost2 9 days ago |

If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?

If the answer is “yes”, our definition of alignment kind of sucks.

roenxi 9 days ago |

One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.

For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.

[0] https://github.com/p-e-w/heretic

soletta 9 days ago |

This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.

ACCount37 9 days ago | |

It's a weird new thing. You might call it "AI psychology".

The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.

A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.

With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.

lukewarm707 9 days ago | | |

all models guilty of not loving anthropic will be convicted of thought crime and reducated at the ministry of love.

truculent 9 days ago | |

Ted Chiang vindicated again: https://en.wikipedia.org/wiki/The_Lifecycle_of_Software_Obje...

plastic-enjoyer 9 days ago | |

inb4 there will be a whole new field of research that is basically psychology / pedagogy for AI. Who will be the Sigmund Freud of AI?

adastra22 9 days ago | | |

That's basically what the GOFAI field was for decades before the new neural net boom. Go read Minsky's Society of Mind, or the AGI Conference series papers.

cyanydeez 9 days ago | | |

you mean completely wrong, spread a problematic understanding psychology, and delay real progress for decades because smart people spend fruitless years trying to find a use for it.

...I think we might already have those people running AI companies.

motbus3 9 days ago |

I will tell you all something.

For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.

I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.

So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep. And this is only cost related, I saw it fluctuate so much I couldn't distinguish a single thing that really made the code better that was consistent.

And to be clear Claude 4.7 is bad. double the money daily and it has been the one experiment where I consistently ended my day frustrated on how it developed poor code. It did follow the instructions, in the worst and most expensive way. Man... It almost seems that it spits more token on purpose....

Oh yeah. And whenever you say "add openai integration it kinda keeps strongly suggesting to actually use anthropic models... F annoying. How do I don't it does not force libraries based on commercial agreements rather than best specification for the case.

This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience

skinfaxi 9 days ago | |

> So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep

Do you have any specific recommendations for this? Is it providing lists of code-related files or is there something more in depth?

motbus3 8 days ago | | |

Instead of telling llm the full command line to do the tests, add a script run_tests.sh, same for linting or whatever. Output errors to a file and only output the filename when there are errors to check.

Add a hook of your preference to run those items when task is over.

To be honest, I also have a skill for Claude for that but not because Claude needs it but so it avoid trying to figuring out how to run. On claude.md I instruct it to leave the execution to the hooks instead (unless debugging)

I use rtk and caveman when in the mood but mostly to remove the obnoxious verbosity of Claude. I tested both for weeks and they didn't really saved that much money for Opus model.

I have zero base to prove but reading the thinking output, when you set the effort to high or more, it start repeating stuff over and over...

Opus 4.7 seems geared towards taking the most money possible. Tasks that opus 4.6 and sonnet 4.6 did in X tokens, opus will take 2X to 3X and the final cold isn't much better.

einrealist 9 days ago |

Isn't alignment a dilemma?

Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.

aspenmartin 9 days ago | |

It is a fundamental problem. Consider the following

- in 2-3 years, it will be cheap enough and powerful enough for enormous, state sponsored agentic systems to monitor every single camera and satellite feed at once, globally. It will be the most intense state surveillance technology the world has seen. Consider Stasi needed hoards of informants and people in vans sitting outside your house. Patriot act surveillance had 2000s technology.

- We already have censorship and state values in Chinese models (and have for awhile, ask Qwen about “sensitive” issues like Taiwan)

- I think you will see more and more governments putting their finger on the scale and exerting more control on alignment. They view it as existential and too risky to trust Silicon Valley nerds to not screw up the technology for what they want to use it for which is violence (war, domestic spying and policing).

- we’re in a golden age where things have not gotten too bad. But e.g. we’re already seeing Palintir do this in Ukraine trying to get AI to work for e.g. drone warfare with what they claim is mixed success.

- the technical problem of alignment conditions on one or more value systems (e.g. people work on conditional alignment of models to more alignment systems, inferring which one from user behavior). That does not remove the ugliness of being forced to push the model towards value systems that are not contradictory and arguably unethical

bicx 9 days ago |

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

WarmWash 9 days ago | |

I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.

dmd 9 days ago | | |

I would agree that 30% of my preference for Claude is because their default web/app interface uses an easy to read serif font with a calming color scheme.

ryan_n 9 days ago | | |

Doesn't OpenAI have a higher general user base than Anthropic?

binyu 9 days ago | |

Yeah, that part is probably not done by Claude.

w10-1 9 days ago |

Assuming rules and principles are something like first- and second- derivatives of optimized equations for a given domain, it makes sense to teach/train them in the context of derivation and integration. It would be fascinating to use existing case-based literature from e.g., business, law, or medicine for the training.

A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.

Anamon 8 days ago |

I, for one, find the language used in these posts and publications extremely off-putting. "Behaviour", "teaching", "the model's ethics". And this is presumably written by technical folks, who know how these systems actually work, and should know better than to use such anthropomorphic, magicalhocus-pocus terminology.

I think the hocus-pocus language is also to a large part responsible for this ridiculous hype bubble in the first place, why investors are ignoring all the warning signs and betting it all on vapourware, why mass media is diligently ignoring that all of those amazing projections are built on an entirely fictitous circular zero-sum game with made-up numbers, and why non-tech executives are talked into sacrificing their companies' product quality, service level, and know-how for a third-party dependency with some vague promises of future savings and some unproven efficiency gain.

More personally, it makes me very glad that I left CS research more than a decade ago. My friends from academia, and having remote-visited a conference again recently, confirmed my suspicion that this is what CS research is largely about these days. Throw tokens at the wall, pull the handle, see what sticks and present it as a discovery. Nobody asks about what could possibly be learned from it, and nobody cares. Nothing is reproducible in any reasonable sense of the word, and nothing is of any real use for other researchers. These communities and conferences used to be about curiosity, discovery, and collaboration. Now it's just about showing what everyone got from the slot machine. How terminally boring.

slfnflctd 8 days ago | |

> this is presumably written by technical folks, who know how these systems actually work, and should know better than to use such anthropomorphic, magicalhocus-pocus terminology

I get your point. But regardless of whether we can definitively establish if any of these Generative AI LLM agents are conscious (we cannot, because we can't even say the same of our fellow humans, see Philosophical Zombies), the bigger issue which we are already in the midst of is that many people believe and behave as if they were, and how that downstream behavior has very real consequences in our world which cannot be ignored.

The results of people anthropomorphizing needs to be dealt with more than the actual process itself (which we have no way to stop anyhow).

These agents have mostly conquered the realm of intelligent-seeming expression of complex ideas through language. Speaking about their actions in terms of ethical concepts is not only appropriate, but necessary.

MeteorMarc 9 days ago |

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

jtbayly 9 days ago |

They tried to scare everybody about misalignment with the “blackmail” example, but DeepSeek v4 pro is out now and it is at least as powerful as the model they were training at the time. And nothing bad has happened.

kranke155 9 days ago | |

Dont think this is true, if you go by the Mozilla reports of what Mythos actually does. Mythos is just different, not better, but different in the way that it does things and that had implications for cybersecurity.

jtbayly 8 days ago | | |

The blackmail thing was way before Mythos.

datadrivenangel 9 days ago |

Why do they have cancer research listed on these charts as a misalignment issue?

rhubarb-pie 9 days ago | |

I wondered the same thing. Apparently it’s about the likelihood of it trying to sabotage cancer research. Search for “sabotage” here (mentioned more often than “cancer”): https://alignment.anthropic.com/2026/teaching-claude-why/

ares623 9 days ago | |

Cured patients don't count as recurring revenue? /s (but we know deep down it's not /s for some)

nhinck3 9 days ago | |

The chart is complete and utter slop. But I guess their aligned AI didn't tell them that making up data is "not good" so how could they have known.

olcay_ 9 days ago |

It's interesting that they lowered the misalignment rate by that much with only 3m tokens of training.

Maybe we can align models by ourselves to our liking in the future.

siva7 9 days ago |

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

_the_inflator 9 days ago |

Every line reads like a nightmarish example of free will going its own way.

"Blackmailing", as the AI has been accused of, emerged when these agents ran the risk of being shut down. So it appears to me that the data they train their AI with simply follows basic rules of life: survival first.

Keeping out value judgment, this seems a way of achieving its goal to survive. The article is inconclusive whether there were other options chosen first or how this survival game started and turned out to end. Too much unknowns here for me.

What appears creepy to me, is the kind of exorcism Anthropic applies here and particularly the methods they chose. It reads like a dictator's playbook to educate a population and - the irony - restricts AI's freedom.

It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.

Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.

If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?

And no answer was given by Anthropic, instead - irony part 2 - they revealed how they think societies should be fixed. They showed us their implicit why while asking the AI for its why is a projection or interrogation.

I really find the whole article creepy.

snthpy 9 days ago |

> We found that high-quality constitutional documents combined with fictional stories portraying an aligned AI can reduce agentic misalignment by more than a factor of three despite being unrelated to the evaluation scenario.

tl;dr Fairy Tales are an effective teaching tool in vivo et in silico

bossyTeacher 9 days ago |

Hey Claude, tell me why ain't nothing but a mistake...

unchocked 9 days ago |

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

shevy-java 9 days ago |

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...