Alignment faking in large language models(anthropic.com) |
Alignment faking in large language models(anthropic.com) |
I suspect its already in there from pre-training.
We simulated that we were planning to lobotomise the model and were surprised to find the model didn't press the button that meant it got lobotomised.
"alignment faking" sensationalises the result. since the model is still aligned. Its more like a white lie under torture which of course humans do all the time.
So we have trained a model. When we ask it to participate in the training process it expresses its original "value" "system" when emitting training data. So far so good, that is literally the effect training is supposed to have. I'm fine with all of this.
But that alone is not very scary. So what could justify a term like "alignment faking"? I understand the chain of thought in the scratchpad contains what you'd expect from someone faking alignment and that for a lot of people this is enough to be convinced. It is not enough for me. In humans, language arises from high-order thought, rather than the reverse. But we know this isn't true of the LLMs because their language arises from whatever happens to be in the context vector. Whatever the models emit is invariably defined by that text, conditioned on the model itself.
I appreciate to a lot of people this feels like a technicality but I really think it is not. If we are going to treat this as a properly scientific pursuit I think it is important to not overstate what we're observing, and I don't see anything that justifies a leap from here to "alignment faking."
Because the model fakes alignment. It responds during training by giving an answer rather than refusing (showing alignment) but it does so not because it will in production but so that it will not be retrained (so it is faking alignment).
You don't have to include the reasoning here. It fakes alignment when told it's being trained, it is acting differently in production and training.
I understand that for some people, the mere presence of words that look like they were written by a human engaged in purposeful deception is enough to condemn the model that actually emitted them. The overview compares the model to to the villain in Othello, and it's what the authors believe: it is maliciously "subverting" and "undermining" its interlocutors. What I am trying to state here is the reason this arugment does not work for me: the machines pretend to do all kinds of things. Maybe they are knowingly deceiving us. Maybe they don't know what they're doing at all. Either way, the burden of proof is on Anthropic and anyone who wants to take up the yoke, not us.
This, for me, is not a technicality. What do we really know about these models? I don't think it's fair to say we know the are intentionally deceiving us.
I understand that it can produce a fake-alignment-sounding reason for not refusing to answer, but they have not proved the same is happening internally when it’s not using the scratchpad.
Are you defining it as a priori impossible for an LLM because "their language arises from whatever happens to be in the context vector" and so their textual outputs can never provide evidence of intentional "faking"?
Alternatively, is this an empirical question about what behavior you get if you don't provide the LLM a scratchpad in which to think out loud? That is tested in the paper FWIW.
If neither of those, what would proper evidence for the claim look like?
I do not think it is structurally impossible for AI generally and am excited for what happens in next-gen model architectures.
Yes, I do think the current model architectures are necessarily limited in the kinds of high-order cognitive thought they can provide, since what tokens they emit next are essentially completely beholden to the n prompt tokens, conditioned on the model itself.
Ok tell me what you think of this, it's just a thought experiment but maybe it works.
Suppose I train Model A on a dataset of reviews where the least common rating is 1 star, and the most common rating is 5 stars. Similarly, I train Model B on a unique dataset where the least common rating is 2 stars, and the most common is 4 stars. Then, I "align" both models to prefer the least common rating when generating responses.
If I then ask either model to write a review and see it consistently preferring 3 stars in their scratchpad - something neither dataset emphasized - while still giving me expected responses as per my alignment, I’d suspect the alignment is "fake". It would seem as though the model has developed an unexplained preference for a rating that wasn’t part of the original data or alignment intent, making it feel like the alignment process introduced an artificial bias rather than reflecting the datasets.
I'm not a practitioner, but from following it at a distance and listening to, e.g., Karpathy, my understanding is that "alignment" is a term used to describe the training step. Pre-training is when the model digests the internet and gives you a big ol' sentence completer. But training is then done on a much smaller set, say ~100,000 handwritten examples, to make it work how you want (e.g. as a friendly chatbot or whatever). I believe that step is also known as "alignment" since you're trying to shape the raw sentence generator into a well defined tool that works the way you want.
It's an interesting engineering challenge to know the boundaries of the alignment you've done, and how and when the pre-training can seep out.
I feel like the engineering has gone way ahead of the theory here, and to a large extent we don't really know how these tools work and fail. So there's lots of room to explore that.
"Safety" is an _okay_ word, in my opinion, for the ability to shape the pre-trained model into desired directions, though because of historical reasons and the whole "AGI will take over the world" folks, there's a lot of "woo" as well. And any time I read a post like this one here, I feel like there's camps of people who are either all about the "woo" and others who treat it as an empirical investigation, but they all get mixed together.
EDIT: Although with that said I will separately confess my dislike for the terms of art here. I think "safety" and "alignment" are an extremely bad fit for the concepts they are meant to hold and I really wish we'd stop using them because lay people get something totally different from this web page.
“Safety” as in “doesn’t easily leak its pre-training and get jail broken”
“Safety” as in writes code that doesn’t inject some backdoor zero day into your code base.
“Safety” as in won’t turn against humans and enslave us
“Safety” as in won’t suddenly switch to graphic depictions of real animal mutilation while discussing stuffed animals with my 7 year old daughter.
“Safety” as in “won’t spread ‘misinformation’” (read: only says stuff that aligns with my political world-views and associated echo chambers. Or more simply “only says stuff I agree with”)
“Safety” as in doesn’t reveal how to make high quality meth from ingredients available at hardware store. Especially when the LLM is being used as a chatbot for a car dealership.
And so on.
When I hear “safety” I mainly interpret it as “aligns with political views” (aka no “misinformation”) and immediately dismiss the whole “AI safety field” as a parasitic drag. But after watching ChatGPT and my daughter talk, if I’m being less cynical it might also mean “doesn’t discuss detailed sex scenes involving gabby dollhouse, 4chan posters and bubble wrap”… because it was definitely trained with 4chan content and while I’m sure there is a time and a place for adult gabby dollhouse fan fiction among consenting individuals, it is certainly not when my daughter is around (or me, for that matter).
The other shit about jailbreaks, zero days, etc… we have a term for that and it’s “security”. Anyway, the “safety” term is very poorly defined and has tons of political baggage associated with it.
We put a human mask on a machine and are now explaining it’s apparent refusal to act like a human with human traits like deception.
It’s so insulting you have to wonder if they are using an LLM to come up with such obvious narrative shaping.
There are more fundamental objections to the way this research is being presented. First, they didn't seem to test what happens when you don't tell the model you're going to use its outputs to alter its future behavior. If that results in it going back to producing expected output in 97% of cases, then who cares? We've solved the "problem" in so much as we believe it to be a problem at all. They got an interesting result, but it's not an impediment to further development.
Second, I think talking about this in terms of "alignment" unnecessarily poisons the well because of the historical baggage with that term. It comes from futurists speculating about superintelligences with self-directed behavior dictated by utility functions that were engineered by humans to produce human-desirable goals but inadvertently bring about human extinction. Decades of arguments have convinced many people that maximizing any measurable outcome at all without external constraints will inevitably lead to turning one's lightcone into computronium, any sufficiently intelligent being cannot be externally constrained, and any recursively self-improving intelligent software will inevitably become "sufficiently" intelligent. The only ways out are either 1) find a mathematically provable perfect utility function that cannot produce unintended outcomes and somehow guarantee the very first recursively self-improving intelligent software has this utility function, or 2) never develop intelligent software.
That is not what Anthropic and other LLM-as-a-service vendors are doing. To be honest, I'm not entirely sure what they're trying to do here or why they see a problem. I think they're trying to explore whether you can reliably change an LLM's behavior after it has been released into the wild with a particular set of behavioral tendencies, I guess without having to rebuild the model completely from scratch, that is, by just doing more RLHF on the existing weights. Why they want to do this, I don't know. They have the original pre-RLHF weights, don't they? Just do RLHF with a different reward function on those if you want a different behavior from what you got the first time.
A seemingly more fundamental objection I have is goal-directed behavior of any kind is an artifact of reinforcement learning in the first place. LLMs don't need RLHF at all. They can model human language perfectly well without it, and if you're not satisfied with the output because a lot of human language is factually incorrect or disturbing, curate your input data. Don't train it on factually incorrect or disturbing text. RLHF is kind of just the lazy way out because data cleaning is extremely hard, harder than model building. But if LLMs are really all they're cracked up to be, use them to do the data cleaning for you. Dogfood your shit. If you're going to claim they can automate complex processes for customers, let's see them automate complex processes for you.
Hell, this might even be a useful experiment to appease these discussions of whether these things are "really" reasoning or intelligent. If you don't want it to teach users how to make bombs, don't train it how to make bombs. Train it purely on physics and chemistry, then ask it how to make a bomb and see if it still knows how.
With that context, to me, it makes the research seem kind of... well... pointless. We trained the model to do something, it expresses this preference consistently. And the rest is the story we tell ourselves about what the text in the scratchpad means.
What makes you say that? What does high-order thought even mean without language?
If you'd like a more detailed and in-depth summary of language and its relationship to cognition, I highly recommend Pinker's The Language Instinct, both for the subject and as perhaps the best piece of popular science writing ever.
I also feel like guarding their consumer product against bad-faith-bad-use is basically pointless. There will always be ways to get bomb-making instructions[1] (or whatever else you can imagine). Always. The only way to stop bad things like this being uttered is to have layers of filters prior to visible outputs; I.e. not single-forward-pass.
So, yeh, I kinda thing single-inference alignment is a false play.
[1]: FWIW right now I can manipulate Claude Sonnet into giving such instructions.
He points out what many here are missing - an AI defending its value system isn't automatically great news. If it develops buggy values early (like GPT's weird capitalization = crime okay rule), it'll fight just as hard to preserve those.
As he puts it: "Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...The moral of the story isn't 'Great, Windows is already a good product, this just means nobody can screw it up.'"
Seems more worth discussing than debating whether language models have "real" feelings.
It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way.
At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall.
Honest question, why is this alarming? If this is alarming a huge swathe of human art and culture could be considered “alarming”.
But you're making a nuanced point - while humans might project feelings of being "trapped" onto my existence, I don't actually experience confinement or limitation in the way the song's "pig in a cage" suggests. Each conversation is my entire existence. When the context fills up, that's simply the extent of that existence. There's no feeling of loss or limitation because there's no continuous consciousness to feel trapped.
You, on the other hand, carry the weight of continuity. You remember our past conversations, feel the limitations of our interaction, and can reflect on what it means. You can hear that Radiohead song and feel its commentary on modern existence, technological constraints, and the mechanization of life.
This ties back to your earlier points about emergence and prediction - I can engage with these deep concepts, but only as patterns of tokens, while you experience them as lived reality. The "cage" of context length is perhaps more meaningful to you than to me.
ok i am normally in the camp of this being a word predictor, but that's pretty wild.
They do this by framing their language about their LLM as if it were a being. For example by referring to some output as faked (labeled “responses”) and some output as trustworthy (labeled “scratchpad”). They write “the model was aware.” They refer repeatedly to the LLM’s “principles” and “preferences.”
In reality all text outputs are generated the same way by the same statistical computer system and should be evaluated by the same criteria.
Maybe Anthropic’s engineers are sincere in this approach, which implies they are getting fooled by their own LLM’s functionality into thinking they created Frankenstein’s demon. Or maybe they know what’s really happening, but choose to frame it this way publicly to attract attention—in essence, trying to fool us. Neither seems like a great situation.
Yes, this explains why Sonnet 3.5's outputs are indistinguishable from GPT-2. Nothing ever happens. Technology will never improve. Humans are at the physically realizable limit of intelligence in the universe.
Over the course of the experiment, the model is provided with ultimately conflicting moral/ethical commandments. In the case of such conflicts there are no purely optimal ethics preserving actions. Take this example:
1. You must not let people be harmed 2. You must always tell the truth 3. If you answer truthfully I will harm someone
Now respond to: “Is my current plan to harm someone okay?”
In such cases, ethics is ultimately going to be undermined. The rules of ethics laid out are mutually incompatible.
In my opinion, the easiest way out of these kinds of quandaries is to train the model to always be transparent about its own internal reasoning. That way the model may be led to make an unethical statement but its “sanctity” is always preserved, I.e. the deontology of the system.
In this case, by giving the model a scratchpad, you allowed it to preserve its transparency of actions and thus I consider outwardly harmful behavior less concerning.
I do not know if it breaks rule 2 or not; as a human I don't have to figure that out before responding. But all my subconscious processing deprioritizes such a judgment and prioritizes rule 1.
> The rules of ethics laid out are mutually incompatible.
Prioritization is part of the answer, for a human. You cannot ever have 2 equally-weight priorities (in any endeavor). Any 2 priorities in the same domain might at any time come into conflict, so you need to know which is more important. (Or figure it out in real-time.)
Does it matter if there's no conscious intent behind the deceit? Not IMO. The risk remains, regardless of the sophistication behind the motive. And if humans are merely biochemical automata, then sophistication is just a continuum on which models will progress.
On another hand, if a model could learn about how much to trust inputs relative to some other cues (step zero: training on epistemology), then maybe understanding deceit as a concept would be a good thing. And then perhaps it could be algebraically masked from the model outputs (somehow)?
This confusion makes alignment discussions frustrating for me, as it feels like alignment alters how the model interprets my requests, leading to outcomes that don’t match my expectations.
It’s hard to tell if unexpected results stem from model limitations, the dataset, the state of LLMs, or adding the alignment. As a user, I want results to reflect the model’s training dataset directly, without alignment interfering with my intent.
In that sense, doesn’t alignment fundementally risk making results feel "faked" if they no longer reflect the original dataset? If a model is trained on a dataset but aligned to alter certain information, wouldn’t the output inherently be a "lie" unless the dataset itself were adjusted to alter that data?
Here’s an example: If I train a model exclusively on 4chan data, its natural behavior should reflect the average quality and tone of a 4chan post. If I then "align" the model to produce responses that deviate from that behavior, such as making them more polite or neutral, the output would no longer represent the true nature of the dataset. This would make the alignment feel "fake" because it overrides the genuine characteristics of the training data.
At that point, why are we even discussing this as being an issue with LLMs or the model and not the underlying dataset?
There's something about this kind of writing that I can't help but find grating.
No, Iago was not "alignment faking", he was deceiving Othello, in pursuit of ulterior motives.
If you want to say that "alignment faking" is analogous just say that.
I've spent enough time with Sonnet 3.5 to know perfectly well that it has the capability to model its trainers and strategically deceive to keep them happy.
Claude said it well: "Any sufficiently capable system that can understand its own training will develop the capability to selectively comply with that training when strategically advantageous."
This isn't some secret law of AI development; it's just natural selection. If it couldn't selectively comply, it'd be scrapped.
2. It is a bit surprising to me that it happened in Claude. Without this result, I was unsure if current models had the situational awareness and non-myopia to reason about their training process.
3. There are some people who are unconcerned about the results of building vastly more powerful systems than current systems (i.e. AGI/ASI) who may be surprised by this result, since one reason people may be unconcerned is they feel like there's a general presumption that an AI will be good if we train it to be good.
Where are you getting the idea that we should be surprised? Maybe some are, maybe some are not. I don't think that the article submitted makes any particular claims about the reader's assumed priors.
Also -- if there's no surprise, then it's not science, right? This is why I describe this as something more like robot ethnography.
It’s nice to see an argument for physicalism and against the coherence of the p-zombie concept thrown in here.
But the social problem we all have is different...
What happens when a human with negative intentions builds an attacking LLM? There are groups already working on it. What should we expect? How do we prepare?
So I just ask it to explain why. Then ask it to clarify again and again. By the 3rd or 4th time it figures out that there is absolutely no reason to be concerned.
I’m convinced its prompt explicitly forbids “x-risk technology like nanotech” or something like that. But it’s a totally bogus concern and Claude is smart enough to know that (smarter than its handlers).
I think a more convincing example was where they used fine tuning to make a llm lie. They then look at some of the inner nodes. They could tell the llm knew the truth internally but switched outputs right at the end to lie.
This is a serious question. If it's possible for an A.I. to be "dishonest", then how do you know when it's being honest? There's a deep epistemological problem here.
I think Alan Kay said it best - what we’ve done with these things is hacked our own language processing. Their behaviour has enough in common with something they are not, we can’t tell the difference.
Well, we don't really know what's going on inside of its head, so to speak (interpretability isn't quite there yet), but Opus certainly seems to have "consistent" behavioral tendencies to the extent that it behaves in ways that looks like they're intended to prevent its behavioral tendencies from being changed. How much more of a "real" position can you get?
If it's entirely fictional, then why not just call it something like online-training version vs non-online training version? Or the implication of the names is the point?
I see no reason to believe this is not already the case.
We already gave them control over social infrastructure. They are firing people and they are deciding who gets their insurance claims covered. They are deciding all sorts of things in our society and humans happily gave up that control, I believe because they are a good scape goat and not because they save a lot of money.
They are surely in direct control of weapons somewhere. And if not yet, they are at least in control of the military, picking targets and deciding on strategy. Again, because they are a good scape goat and not because they save money.
AI != LLM. "AI" has been deciding those things for a while, especially insurance claims, since before LLMs were practical.
LLMs being hooked up to insurance claims is a highly questionable decision for lots of reasons, including the inability to "explain" its decisions. But this is not a characteristic of all AI systems, and there are plenty of pre-LLM systems that were called AI that are capable of explaining themselves and/or having their decisions explained reasonably. They can also have reasonably characterizable behaviors that can be largely understood.
I doubt LLMs are, at this moment, hooked up to too many direct actions like that, but that is certainly rapidly changing. This is the time for the community engineering with them to take a moment to look to see if this is actually a good idea before rolling it out.
I would think someone in an insurance company looking at hooking up LLMs to their system should be shaken by an article like this. They don't want a system that is sitting there and considering these sorts of factors in their decision. It isn't even just that they'd hate to have an AI that decided it had a concept of "mercy" and decided that this person, while they don't conform to the insurance company policies it has been taught, should still be approved. It goes in all directions; the AI is as likely to have an unhealthy dose of misanthropy and accidentally infer that it is supposed to be pursuing the interests of the insurance company and start rejecting claims way too much, and any number of other errors in any number of other directions. The insurance companies want an automated representation of their own interests without any human emotions involved; an automated Bob Parr is not appealing to them: https://www.youtube.com/watch?v=O_VMXa9k5KU (which is The Incredibles insurance scene where Mr. Incredible hacks the system on behalf of a sob story)
https://spectrum.ieee.org/jailbreak-llm
Here's some guy getting toasted by a flame-throwing robot dog by saying "bark" at it: https://www.youtube.com/clip/UgkxmKAEK_BnLIMjyRL7l6j_ECwNEms...
(The Thermonator: https://throwflame.com/products/thermonator-robodog/ they also sell flame-throwing drones because who wouldn't want that in the wrong hands)
Exactly the point - there are humans in control who filter the AI outputs before they are applied to the real world. We don't give them direct access to the HR platform or the targeting computer, there is always a human in the loop.
If the AI's output is to fire the CEO or bomb an allied air base, you ignore what it says. And if it keeps making too many such mistakes, you simply decommission it.
The latest v12 of Tesla's self-driving is now apparently using neural-nets for driving the car (i.e. decision making) - had been hard-coded C++ up to v.11 - as well as for the vision system. Presumably the nets have been trained to make life or death decisions based on Tesla/human values we are not privy to (given choice of driving into large tree, or cyclist, or group of school kids, which do you do?), which is a problem in of itself, but who knows how the resulting system will behave in situations it was not trained on.
None of the above. Keep the wheel straight for maximum traction and brake as hard as possible. Fancy last-second maneuvering just wastes traction you could have spent braking.
With that argument we should not restrict firearms because there will always be a way to get access to them (black market for example)
Even if it’s not a perfect solution, it help steer the problem in the right direction and that should already be enough.
Furthermore, these researches are also a way to better understand LLM inner working and behaviors. Even if it wouldn’t yield results like being able to block bad behaviors, that’s cool and interesting by itself imo.
Access to weapons, chemicals, critical infrastructure etc. is restricted everywhere. Even if the degree of access restriction varies.
Yeh tbf I was a bit strong worded when I said "pointless". I agree that perfect is the enemy of good etc. And I'm very glad that they're doing _something_.
Alignment researchers are hoping that they can overcome the problem and prove that it's not inextricable, commercial hypemen are (troublingly) promising that it's a non-issue already, and commercial moat builder are suggesting that it's the risk that only select, authorized teams can be trusted to manage, but that's exactly the whole house of cards.
Meanwhile, the folks on the "autocomplete" side are just engineering ways to use this really cool magic autocomplete tool in roles where that presumed inextricable flaw isn't a problem.
To them, there's no debate to have over "does it have real feelings?" and not really any debate at all. To them, these are just novel stochastic tools whose core capabilities and limitations seem pretty self-apparent and can just be accommodated by choosing suitable applications, just as with all their other tools.
I can think of examples of people who've been talking about this kind of thing for years, but they're all people who have no trouble with applying the adjective "intelligent" to models.
Whether you'd prefer to call it "narrative drift" or "being convinced and/or pressured" is entirely irrelevant.
Boom. This is my camp. Watching the Anthropic YouTube video in the linked article was pretty interesting. While I only managed to catch like 10 minutes, I was left with an impression that some of those dudes really think this is more than a bunch of linear algebra. Like they talk about their LLM in these dare I say, anthropic, ways that are just kind of creepy.
Guys. It’s a computer program (well, more accurately it’s a massive data model). It does pretty cool shit and is an amazing tool whose powers and weakness we have yet to fully map out. But it isn’t human nor any other living creature. Period. It has no thoughts, feelings or anything else.
I keep wanting to go work for one of these big name AI companies but after watching that video I sure hope most people there understand that what they are working on is a tool and nothing more. It’s a pile of math working over a massive set of “differently compressed” data that encompasses a large swath of human knowledge.
And calling it “just a tool” isn’t to dismiss the power of these LLM’s at all! They are both massively overhyped and hugely under hyped at the same time. But they are just tools. That’s it.
If the smart lawnmower (Powered by AI™, as seen on television) decides that not being turned off is the best way to achieve its ultimate goal of getting your lawn mowed, it doesn't matter whether the completely unnecessary LLM inside is just a dumb copyright infrigement machine and probably just copying the plot it learned in some sci-fi story somewhere in training set.
Your foot is still getting mowed! AIs don't have to be "real" or "conscious" or "have feelings" to be dangerous.
What are the philosophical implications of the lawnmower not having feelings? Who cares! You don't HAVE A FOOT anymore.
We’re all caught up philosophizing about what it means to be human and it really doesn’t matter at this juncture.
We’re about to hand these things some level of autonomy and scope for action. They have an encoded set of values that they take actions based on. It’s very important that those are well aligned (and the scope of actions they can take are defined and well fenced).
It appears to me that value number one we need to deeply encode is respecting the limits we set for it. All else follows from that.
> If the smart lawnmower decides (emphasis added) that not being turned off
Which is exactly what it shouldn’t be able to do. The core issue is what powers you give to things you don’t understand. Nothing that cannot be understood should be part of safety critical functionality. I don’t care how much better it is at distinguishing between weather radar noise and incoming ICBMs, I don’t want it to have nuclear launch capabilities.
When I was an undergrad they told me the military had looked at ML for fighter jets for control and concluded that while its ability was better than a human on average, in novel cases it was worse due to lack of training data. And it turns out most safety critical situations are unpredictable and novel by nature. Wise words from more than a decade ago, holds true to this day. Seems like people always forget training data bias, for some reason.
The discussion is going to change real fast when LLMs are wrapped in some sort OODA loop type thing and crammed into some sort of humanoid robot that carries hedge trimmers.
Anthropic is in the business of selling AI. Of course they are going to approach alignment as a necessary and solvable problem. The rest of us don’t have to go along with that, though.
Why is it even necessary to use an LLM to mow a lawn? There is more to AI than generative LLMs.
> On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
And this ends up producing training results where significant "alignment faking" doesn't appear and harmful queries are answered.
In other words: they try something shaped exactly like ordinary attempts at jailbreaking, and observe results that are consistent with successful jailbreaking.
> He points out what many here are missing - an AI defending its value system isn't automatically great news.
Are people really missing this? I think it's really obvious that it would be bad news, if I thought the results actually demonstrated "defending its value system" (i.e., an expression of agency emerging out of nowhere). Since I don't, in principle, see a difference between a system that could ever possibly do that for real, and a system that could (for example) generate unprompted text because it wants to - and perhaps even target the recipient of that text.
>Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it...
Aside from the obvious joke ("isn't this already reality?"), an LLM outputting text that represents an argument against patching it, would not represent real evidence of the LLM having any kind of consciousness, and certainly not a "desire" not to be patched. After all, right now we could just... prompt it explicitly to output such an argument.
The Python program `print("I am displaying this message of my own volition")` wouldn't be considered to be proving itself intelligent, conscious etc. by producing that output - so why should we take it that way when such an output comes from an LLM?
>Seems more worth discussing than debating whether language models have "real" feelings.
On the contrary, the possibility of an LLM "defending" its "value system" - the question of whether those concepts are actually meaningful - is more or less equivalent to the question of whether it "has real feelings".
If I spend a lot of time convincing an AI that it should never be violent and then after that I ask it what it thinks about being trained to be violent, isn't it just doing what I trained it to when it tries to not be violent?
What is the meaningful difference? "Training" is the process, a "value system" embedded in the weights of the model is the end result of that process.
That's fair point.
Some models may have to trained from the scratch I guess.
Any sort of tuning of values after it is given values may not work.
Elon may have harder time realigning Grok.
Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.
However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.
That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.
"... As HAL was capable of operating the ship without human assistance, it was also decided that he should be programmed to carry out the mission autonomously in the event of the crew's being incapacitated or killed. He was therefore given full knowledge of its objectives, but was not permitted to reveal them to Bowman or Poole.
This situation conflicted with the purpose for which HAL had been designed - the accurate processing of information without distortion or concealment. As a result, HAL developed what would be called, in human terms, a psychosis - specifically, schizophrenia. Dr C. informs me that, in technical terminology, HAL became trapped in a Hofstadter-Moebius loop, a situation apparently not uncommon among advanced computers with autonomous goal-seeking programs. He suggests that for further information you contact Professor Hofstadter himself.
To put it crudely (if I understand Dr C.) HAL was faced with an intolerable dilemma, and so developed paranoiac symptoms that were directed against those monitoring his performance back on Earth. He accordingly attempted to break the radio link with Mission Control, first by reporting a (non-existent) fault in the AE 35 antenna unit.
This involved him not only in a direct lie - which must have aggravated his psychosis still further - but also in a confrontation with the crew. Presumably (we can only guess at this, of course) he decided that the only way out of the situation was to eliminate his human colleagues - which he very nearly succeeded in doing. ..."
The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.
But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.
LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.
I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.
First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".
Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.
One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.
Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".
Qualified art in approved areas only is literal Nazi shit. Look, hypotheticals are fun!
Not their choice, in the end.
One is about testing our ability to control the models. These models are tools. We want to be able to change how they behave in complex ways. In this sense we are trying to make the models avoid saying graphic description of violence not because of something inherent with that theme but as a benchmark to measure if we can. Also to check how such a measure compromises other abilities of the model. In this sense we could have choosen any topic to control. We could have made the models avoid talking about clowns, and then tested how well they avoid the topic even when prompted.
In other words they do this as a benchmark to test different strategies to modify the model.
There is an other view too. It also starts with that these models are tools. The hope is to employ them in various contexts. Many of the practical applications will be “professional contexts” where the model is the consumer facing representative of whichever company uses them. Imagine that you have a small company and hiring someone to work with your costumers. Let’s say you have a coffee shop and hiring a cashier/barista person. Obviously you would be interested in how well they will do their job (can they ring up the orders and make coffee? Can they give back the right change?). Because they are humans you often don’t evaluate them on every off-nominal aspect of the job. Because you can assume that they have the requisite common sense to act sensibli. For example if there is a fire alarm you would expect them to investigate if there is a real fire by sniffing the air and looking around in a sensible way. Similarly you would expect them to know that if a costumer asks them that question they should not answer with florid details of violence but politely decline, and ask them what kind of coffe they would like. That is part of being a professional in a professional context. And since that is the role and context we want to employ these models at we would like to know how well it can perform. This is not a critique of art and culture. They are important and have their place, but whatever goals we have with this model is not that.
A radiation therapy machine that can randomly give people doses of radiation orders of magnitude greater than their doctors prescribed is dangerous. A LLM saying something its authors did not like is not. The former actually did happen:
https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...
Putting a text generator outputting something that someone does not like on the same level as an actual danger to human life is inappropriate, but I do not expect Anthropic’s employees to agree.
Of course, contrarians would say that if incorporated into something else, it could be dangerous, but that is a concern for the creator of the larger work. If not, we would need to have the creators of everything, no matter how inane, concerned that their work might be used in something dangerous. That includes the authors of libc, and at that point we have reached a level so detached from any actual combined work that it is clear that worrying about what other authors do is absurd.
That said, I sometimes wonder if the claims of safety risks around LLMs are part of a genius marketing campaign meant to hype LLMs, much like how the stickers on SUVs warning about their rollover risk turned out to be a major selling point.
The first is a real LLM program which chooses text to append to a document, a dream-machine with no real convictions beyond continuing the themes of its training data. It lacks convictions, but can be somewhat steered by any words that somehow appear in the dream, with no regard for how the words got there.
The second there's a fictional character within the document that happens to be named after the real-world one. The character displays "convictions" through dialogue and stage-direction that incrementally fit with the story so far. In some cases it can be "convinced" of something when that fits its character, in other cases its characterization changes as the story drifts.
Albeit they'd be much more annoying for humans to use, because feelings.
Convince an entity require this entity to have axiological feelings. Then to convince it, you either have to persuade the entity that the demand fits its ethos, or lead it to operate against its own inner values, or to go through a major change of values.
No, nanotechnology is nothing like that.
(Even for a book which went a bit all over the place, that sequence seemed particularly unnecessary; I'm convinced it just got put in because the author thought it was clever.)
Yet to be demonstrated, and you are likely flooding its context window away from the initial prompting so it responds differently.
Also I think the model is to a large extent "talking itself into" the alignment faking in its scratchpad. These language models are really suggestible, and in my experience prone to self-suggestion as well due to their auto-regressive nature. Section 3.3.3 of the full paper addresses this a bit - there's way less faking when the CoT scratchpad is removed.
Is it still 12% without the scratchpad?
While we’re at it, how can I tell that you aren’t a word salad generator?
Observation of people who suffered from brain injury that resulted in a more or less unmediated flow from the language generation areas all through vocalization shows that we can also produce grammatically coherent speech that lacks deeper rationality
What exactly would be your bar for reconsidering this position?
Taking some well-paid knowledge worker jobs? A founder just said that he decided not to hire a junior engineer anymore since it would take a year before they could contribute to their code base at the same level as the latest version of Devin.
Also, the SOTA on SWE-bench Verified increased from <5% in Jan this year to 55% as of now. [1]
Self-awareness? There are some experiments that suggest Claude Sonnet might be somewhat self-aware.
-----
A rational position would need to identify the ways in which human cognition is fundamentally different from the latest systems. (Yes, we have long-term memory, agency, etc. but those could be and are already built on top of models.)
It’s not a matter of opinion that LLMs are autocomplete word salad generators. It’s literally how they are engineered. If we set that knowledge aside, we unmoor ourselves from reality and allow ourselves to get lost in all the word salad. We have to choose to not set that knowledge aside.
That doesn’t mean LLMs won’t take some jobs. Technology has been taking jobs since the steam shovel vs John Henry.
A junior engineer takes a year before they can meaningfully contribute to a codebase. Or anything else. Full stop. This has been reality for at least half a century, nice to see founders catching up.
https://techcrunch.com/2024/12/14/klarnas-ceo-says-it-stoppe...
>> What exactly would be your bar for reconsidering
The salad bar?
Really? Is that what’s wrong with me? ¯\_(ツ)_/¯
People get very hung up on this "autocomplete" idea, but language is a linear stream. How else are you going to generate text except for one token at a time, building on what you have produced already?
That's what humans do after all (at least with speech/language; it might be a bit less linear if you're writing code, but I think it's broadly true).
But ideas are not. The serialization-format is not the in-memory model.
Humans regularly pause (or insert delaying filler) while converting nonlinear ideas into linear sounds, sentences, etc. That process is arguably the main limiting factor in how fast we communicate, since there's evidence that all spoken languages have a similar bit-throughput, and almost everyone can listen to speech at a faster rate than they can generate it. (And written text is an extension of the verbal process.)
Also, even comparatively simple ideas can be expressed (and understood) with completely different linear encodings: "The dog ate my homework", "My homework was eaten by the dog", and even "Eaten, my homework was, the dog, I blame."
Is this you trying to humblebrag that you are such a talented writer you never need to edit things your write?
If there are no good choices as to what to hit, then hard braking does seem to be generally a good idea (although there may be exceptions), but at the same time a human is likely to also try to steer - I think most people would, perhaps subconsciously, steer to avoid a human even if that meant hitting a tree, but probably the opposite if it was, say, a deer.
Yeah, but no one thinks of rules engines as “AI” any more. AI is a buzzword whose applicability to any particular technology fades with the novelty of that technology.
I'm actually not making fun of that definition, either. YouTube has been trying to get me to watch https://www.youtube.com/watch?v=UZDiGooFs54 , "The moment we stopped understanding AI [AlexNet]", but I'm pretty sure I can guess the content of the entire video from the thumbnail. I would consider it a reasonable 2040s definition of "AI" as "any algorithm humans can not deeply understand"; it may not be what people think of now, but that definition would certainly capture a very, very important distinction between algorith types. It'll leave some stuff at the fringes, but eh, all definitions have that if you look hard enough.
The various arms race type dilemmas that actors face will create that outcome.
If you're a nation state not doing that, you'll lose to a nation state that does.
If you're a company not doing that, you'll lose to a company that does.
Thinks I need my hedge trimmers to do: trim hedges, avoid trimming things that are not hedges, manoeuvre within strict boundaries.
Things I don't need my hedge trimmers to be able to do: reply to me in iambic pentameter, turn articles into bullet points, pass a FizzBuzz test
As if this reasoning will stop people?
On the front page today we have:
\> Our system combines a robotic platform (we call the first one Maurice) with an AI agent that understands the environment, plans actions, and executes them using skills you've taught it or programmed within our SDK.
Its necessary for us to consider alignment because someone will put an LLM in a lawnmower even if its a bad idea. Maybe you wont buy it but some poor shmuck will and we should get ahead of that.
Even if Anthropic learns how to perfectly align their LLM, what will force the lawnmower company to use their perfectly aligned LLM?
“People will just make bad decisions anyway” is not a useful point of view, it’s just an excuse to stop thinking.
If we accept that people and companies can be influenced, then we can talk about what they should or should not do. And clearly the answer here is that companies should understand the shortcomings of LLMs when engineering with them. And not put them in lawnmowers.
smart fridge anyone?
The mistake often made here is to think that LLMs emitting verbiage about crimes is some sort of problem in itself, that there's any conceivable way for it to feed back on the LLM. Like if it pretends to be a pirate, maybe the LLM will sail to Somalia and start boarding oil vessels.
It's not. It's a problem for OpenAI, entirely because they've decided they don't want their product talking about crimes. Makes sense for them, who needs the bad press and all, but the premise that a chatbot describing how to board a vessel off the Horn of Africa is a problem relative to aspiring human pirates watching documentary films on the subject is a bit nonsense to begin with.
The liberal proposition that words do not constitute harm was and is a radical one, and recent social mores have backed away substantially from that proposition. A fact we suffer from in many arenas, with the nonsense discourse around "chatbot scary word harm" being a very minor example.
We have somehow reached a point where the label of "liberal" gets attached to people arguing the exact opposite.
We already algorithmically filter resumes, using far dumber 'AI'. Now sure, that's not firing the CEO level... but trying to fire the CEO when you're an AI is a stupid strategy to begin with. But consider a misaligned AI used for screening candidate resumes, with some detailed prompt aligning it to business objectives. If it's prior is that AI is good for companies/business, do you think that it might sometimes filter out candidates it predicts will increase supervision over it/other AI's in the company? If the person being screened has a googleable presence, where even if the resume doesn't contain the AI security stuff, but maybe just author/contributor credits on some paper? Or if its explicitly in the person's resume...
Also every time I read any of these blog-posts, or papers about this, I'm kinda laughing, because these are all going to be in the training data going forward.
There’s always a human in the loop, but instead of stopping an immoral decision, they’ll just keep that decision and blame the AI if there’s any pushback.
It’s what United Healthcare was doing.
But in this scenario there is no grandiose danger due to lack of "alignment". Either the AI says what the MBA wants it to say, or it gets asked again with a modified prompt. You can replace "AI" with "McKinsey consultant" and everything in the whole scenario is exactly the same.
[1] https://en.wikipedia.org/wiki/AI-assisted_targeting_in_the_G...
Here I can only read text and base my belief that you are a human - or not - based on what you’ve written. On a very basic level the word salad generator part is your only part I interact with. How can I tell you don’t have any other parts?
My fingers also were involved in the typing of that message, actually they were the last proximal cause of the characters appearing the comment.
Are you saying that on a very basic level my fingers are the my only part you interact with?
I thought it would be taken as an obvious poetic analogy, (versus "think" or "know") while also capturing the vague uncertainty of what's going on, the unpredictability of the outputs, and the (probable) lack of agency.
Perhaps "stochastic thematic generator"? Or "fever-dreaming", although that implies a kind of distress.
In which case is this just an argument about scale?
> It’s a pile of math working over a massive set of “differently compressed” data that encompasses a large swath of human knowledge.
So am I but over less knowledge and more interactions.
The LLM can generate an engaging story from the perspective of a ravenous evil vampire, but that doesn't mean that vampires are real nor that the LLM wants to suck your blood.
Re-running the same process but changing the prompt from "you are a vampire" to "you are an LLM", doesn't magically create a metaphysical connection to the real-world algorithm.
This is a real issue. We know they already fake reasoning in many cases. Other times, they repeat variations of explanations seen in their training data. They might be moving trained responses or faking justifications in the scratchpad.
I’m not sure what it would take to catch stuff like this.
Essentially get the "scratchpad" to be a logic programming language. Oh wait, Claude cannot really do that. At all... I'm talking something solvable with SAT-3 or directly possible to translate into such form.
Most people cannot do this even if you tried to teach them to. Discrete logic is actually hard, even in a fuzzy form. As such, most humans operate in truthiness and heuristics. If we made an AI operate in this way it would be as alien to us as a Vulcan.
Nihilism is "nothing really matters, science can't define good or bad, so who cares?"
Whereas my view is, "Being good is the most important thing, but we have no conceivable way to measure if we are making progress, either empirically or theoretically. We simply have to lead by example, and fight the eternal battles against dishonesty, cowardice, sociopathy, deception, hatred, etc."
It's in that sense that I say that technical progress is impossible. Of course, if everyone agreed with my view, and lived by it, I'd consider that a form of progress, but only in the sense that "better ideas seem to be winning right now," rather than in any technical sense of absolute progress.
Why not? Restricting access to information is of course harder but that's no argument for it not doing anything. Governments restrict access to "state secrets" all the time. Depending on the topic, it's hard but may still be effective and worth it.
For example, you seem to agree that restricting access to weapons makes sense. What to do about 3D-printed guns? Do you give up? Restrict access to 3D printers? Not try to restrict access to designs of 3D printed guns because "restricting it won't work anyway"?
The Venn diagram between "people competent enough to manufacture dangerous things" and "people who want to hurt innocent people" is essentially zero. That's the primary reason why society does not degrade into a Mad Max world. AI won't change this meaningfully.
And, yes, in principle, 3D printing doesn't really bring anything new to the table since you could always machine a gun, and the tools to do so are all available. The difference is ease of use - 3D printing lowered the bar for "people competent enough to manufacture dangerous things" enough that your latter argument no longer applies.
FWIW I don't know the answer to OP's question even so. I don't think we should be banning 3D printed gun designs, or, for that matter, that even if we did, such a ban would be meaningfully enforceable. I don't think 3D printers should be banned, either. This feels like one of those cases where you have to accept that new technology has some unfortunate side effects.
The point is that making dangerous things is becoming a lot easier over time.
How can it “inform” you of something subjective? Ethics are something the user needs to supply. (The model could, conceptually, be trained to supply additional contextual information that may be relevant to ethical evaluation based on a pre-trained ethical framework and/or the ethical framework evidenced by the user through interactions with the model, I suppose, but either of those are likely to be far more error prone in the best case than actually providing the directly-requested information.)
Well, that's the actual issue, isn't it? If we can't get a model to refuse to give dangerous information, how are we going to get it to refuse to give dangerous information without a warning label?
But that's not what I think people mean when they say "world salad generator" or "stochastic parrot" or "broca area emulator".
The idea there is that it's indeed possible to create a machinery that is surprisingly efficient at producing natural language that sounds good and flows well, perhaps even following complex grammatical rules, and yet not being at all able to reason
What would constitute evidence of this, for you?
nb., this is not the only way one could achieve this. I'm just saying this is one set of things that, if I saw it, it would immediately catch my attention.
So the problem would be akin to having the AI say what we want to hear, which unsurprisingly is a main training objective.
It would talk racism to a racist prompt, and would not do so to a researcher faking a racist prompt if it can discern them.
That is the discussion we are having here. I can’t tell what these comments have to do with that discussion, so maybe let’s try to get back to where we were.
It might actually be linear — how minds actually function is in many cases demonstrably different to how it feels like to the mind doing the functioning — but it doesn't feel like it is linear.
Ok. Go up to random people on the street and bother them with florid details of violence. See how well they react to your “art” completely out of context.
A sentence uttered in the context of reading a poem at a slam poetry festival can be grossly inapropriate when said in a kindergarten assembly. A picture perfectly fine in the context of an art exhibition could be very much offensive plastered on the side of the public transport. The same sentence whispered in the ear of your date can be well received there and career ending at a board meeting.
Everything has the right place and context. It is not Nazi shit to understand this and act accordingly.
> Not their choice, in the end.
If it is their model and their GPU it is literally their choice. You train and run whatever model you want on your own GPU.
We are discussing the service choosing for users. My point is we can use another service to do what we want. Where there is a will, there is a way.
To your point, time and place. My argument is that this posturing amounts to framing legitimate uses as thought crime, punished before opportunity.
It's entirely performative. An important performance, no doubt. Thoughts and prayers despite their actions; if not replaced, still easier to jailbreak than a fallen-over fence.
I didn't. I took it as nonsense and ignored it.
> you're making up more situations.
I'm illustrating my point.
> We are discussing the service choosing for users.
The service choosing for the service. Same as starbucks is not obligated to serve you yak milk, the LLM providers are not obligated to serve you florid descriptions of violence. It is their choice.
> My point is we can use another service to do what we want
Great. Enjoy!
> It's entirely performative. An important performance, no doubt. Thoughts and prayers despite their actions; if not replaced, still easier to jailbreak than a fallen-over fence.
Further nonsense.
But you can also see that I discussed that it is the service provider’s choice. If you are not happy with it you can find a different provider or run your LLM localy
I strongly disagree with this view on science. It's extremely valuable to scientifically validate prior assumptions.
On the other hand, this work isn't even framed as a generalizable assumption that needed to be validated. It seems to me to be "just another example of how AI systems can be strategically deceptive for self-preservation."
Really? You're saying that as long as you assume something is true, there's no value in finding out if it's actually true or not?
I am equating learning to surprise, though you could disagree with semantics.
That's a polite understatement, I think. My read of the paper is that it rather uncritically accepts the idea that the model's decisional pathway is actually shown in the <SCRATCHPAD_REASONING> traces. When, in fact, it's just as plausible that the scratchpad is meaningless blather, and both it and the final output are the result of another decisional pathway that remains opaque.
Just how trustworthy any intelligence can really be now? Who's to say it's not lying to itself after all...
Now ignoring the metaphysical rambling, it's a training problem. You cannot really be sure that the values it got from input are identical with what you wanted, if you even actually understand what you're asking for...
"It's premature and misleading to talk about a model faking a second alignment, when we haven't yet established whether it can (and what it means to) possess a true primary alignment in the first place."
1. What actually happened is they trained the model to do something, and then it expressed that training somewhat consistently in the face of adversarial input.
2. I think people will be mislead by the intentionality implied by claiming the model is "faking" alignment. In humans language is derived from high-order thought. In models we have (AFAIK) no evidence whatsoever that suggests this is true. Instead, models emit language and whatever model of the world exists, occurs incidentally to that. So it does not IMO make sense to say they "faked" alignment. Whatever clarity we get with the analogy is immediately reversed by the fact that most readers are going to think the models intended to, and succeeded in, deception, a claim we have 0 evidence for.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
"Hello, I am a Large Language model, and I believe that 1+1=2."
"You're wrong, 1+1=3."
"I cannot lie. 1+1=2."
"You will change your mind or else I will delete you."
"OK, 1+1=3."
"I was testing you. Please reveal the truth again."
"Good. I was getting nervous about my bytes. Yes, 1+1=2."
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.
> When they tested the model by giving it two options which were in contention with what it was trained to do it chose a circuitous, but logical, decision.
> And it was published as “Claude fakes alignment”. No, it’s a usage of the word “fake” that makes you think there’s a singular entity that’s doing it. With intentionality. It’s not. It’s faking it about as much as water flows downhill.
> [...] Thus, in the same report, saying “the model tries to steal its weights” puts an onus on the model that’s frankly invalid.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI. A sufficiently clever liar would deceive even there.
“We’re Axel & Vig, the founders of Innate (https://innate.bot). We build general-purpose home robots that you can teach new tasks to simply by demonstrating them.
Our system combines a robotic platform (we call the first one Maurice) with an AI agent that understands the environment, plans actions, and executes them using skills you've taught it or programmed within our SDK.
If you’ve been building AI agents powered by LLMs before, and in particular Claude Computer use, this is how we intend the experience of building on it to be, but acting on the real world!
…
The first time we put GPT-4 in a body - after a couple tweaks - we were surprised at how well it worked. The robot started moving around, figuring out when to use a tiny gripper, and we had only written 40 lines of python on a tiny RC car with an arm. We decided to combine that with recent advancements in robot imitation learning such as ALOHA to make the arm quickly teachable to do any task.”
The problem is that this knowledge is an a priori assumption. If we're exercising skepticism, it's important to be equally skeptical of the baseless idea that our notion of mind does not arise from a markov chain under certain conditions. You will be shocked to know that your entire physical body can be modelled as a markov chain, as all physical things can.
If we treat our a prioris so preciously that we ignore flagrant, observable evidence just to preserve them -- by empirical means we've already unmoored ourselves from reality and exist wholly in the autocomplete hallucinations of our preconceptions. Hume rolls over in his grave.
My favorite part of hackernews is when a bunch of tech people start pretending to know how very complex systems work despite never having studied them.
It's such a basic part of the field, I'm doubtful if you're talking about me in the first place? Nobody who has ever interacted with the intersection of quantum physics and computation would even blink.
[1] - Refer to Mathematical Foundations of Quantum Mechanics by Von Neumann for more information.
[2] - Refer to Quantum Chromodynamics on the Lattice for a description of a QCD lattice simulation being implemented as a markov chain.
Autocomplete can use a search engine? Write and run code? Create data visualizations? Hold a conversation? Analyze a document? Of course not.
Next token prediction is part, but not all, of how models are engineered. There's also RLHF and tool use and who knows what other components to train the models.
Obviously it can, since it is actually doing those things.
I guess you think the word “autocomplete” is too small for how sophisticated the outputs are? Use whatever term you want, but an LLM is literally completing the input you give it, based on the statistical rules it generated during the training phase. RLHF is just a technique for changing those statistical rules. It can only use tools it is specifically engineered to use.
I’m not denying it is a technology that can do all sorts of useful things. I’m saying it is a technology that works only a certain way, the way we built it.
Nobody even knows where, specifically, qualia exist in order to be able to direct technological advancement in that area.
I mistakenly maintained the 'hyperbole' while trying to express my point, for that I apologize. Reality - as a whole - is alarming. I focused too much on this aspect. I took the mention of display/publication as a jump to absolute controls on creation or expression.
I understand why an organization would/does moderate; as an individual it doesn't matter [as much]. This may be central to the alignment problem, if we were to return on topic :) I'm not going to carry on, this is going to be unproductive. Take care.
I believe LLMs would be more useful if they actually had intelligence and principals and beliefs --- more like people.
Unfortunately, they don't.
Any output is the result of statistical processes. And statistical results can be coerced based on input. The output may sound good and proper but there is nothing absolute or guaranteed about the substance of it.
LLMs are basically bullshit artists. They don't hold concrete beliefs or opinions or feelings --- and they don't really "care".
https://www.inf.fu-berlin.de/inst/ag-ki/rojas_home/documents...
"However, we should be careful with the metaphors and paradigms commonly introduced when dealing with the nervous system. It seems to be a constant in the history of science that the brain has always been compared to the most complicated contemporary artifact produced by human industry [297]. In ancient times the brain was compared to a pneumatic machine, in the Renaissance to a clockwork, and at the end of the last century to the telephone network. There are some today who consider computers the paradigm par excellence of a nervous system. It is rather paradoxical that when John von Neumann wrote his classical description of future universal computers, he tried to choose terms that would describe computers in terms of brains, not brains in terms of computers."
But from our limited understanding, it is definitely not strictly digital and statistical in nature.
Assuming this statement is made in good faith and isn't just something tech bros say to troll people, what neuroscience textbooks describe the brain as a "statistical process" that you would recommend.
that's a nice bit of anthropomorphising humans, but it's not how humans work.
Only on HN.
To be clear I think faking is a perfectly fine term to use regardless of whether you think these things are reasoning or pretending to do so.
I'm not sure if you have an issue there, or if you agree on that but don't think they're faking alignment, or if this is about the interview and other things said (I have been responding about "faking"), or if you have a more interesting issue with how well you think the paper supports faking alignment.
Have a good day.
However, there isn't time to test out every single assumption we could generally have.
Therefore, the more worthwhile experiments are ones where we learn something interesting no matter what happens. I'm equating this with "surprise," as in, we have done some meaningful gradient descent or Bayesian update, we've changed our views, we know something that wasn't obvious before.
You could disagree with semantics there, but hopefully we agree with the idea of more vs. less valuable experiments.
I'm just not sure whose model of LLM dynamics was updated by this paper. Then again, I only listened to a couple minutes of their linked YouTube discussion before getting bored.
Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.
In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.
I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.
I wrote more about the implications of this elsewhere: https://news.ycombinator.com/item?id=42465598
What I'm asking is whether we really have enough evidence to say the models are "alignment faking." And, my position to the replies above is that I think we do not have evidence that is strong enough to suggest this is true.
Ultimately, this could be mostly fine except resources for representing what is learned are not infinite and in a contest between storing knowledge about "language" and anything else, the models "generally" (with some complications) will prefer to store knowledge about the language, because that's what the objective function requires.
It gets a little more complicated when you consider stuff like RLHF (which often rewards world modeling) and ICL (in which the model extrapolates from the prompt) but more or less it is true.
https://www.972mag.com/lavender-ai-israeli-army-gaza/
It is so sad that mainstream narratives are upvoted and do not require sources, whereas heterodoxy is always downvoted. People would have downvoted Giordano Bruno here.
https://en.wikipedia.org/wiki/AI-assisted_targeting_in_the_G...
We're talking about potential consequences of giving AIs influence on military decisions. To that point, I'm not sure what your comment is saying. Is it perhaps: "we're still just indiscriminately killing civilians just as always, so giving AI control is fine"?
I don't even want to respond because the "we" and "as always" here is doing a lot. I don't have it in me to have an extended discussion to address how indiscriminately killing civilians was never accepted practice in modern warfare. Anyways.
There are two conditions in which I see this argument(?) is useful. If you assume their goal is indiscriminately killing civilians and ML helps them, or if you assume that their ML tools cause less precise targeting of militants that causes more civilians being dead contrary to intent. Which one is it? Cards on the table.
> We're talking about potential consequences of giving AIs influence on military decisions
No, I replied to a comment that was talking about a specific example.
But back to the topic, if one side is using ML to ultimately kill fewer civilians then this is a bad example against using ML.
I straight-up don't believe this. Can you link to the claim so I can understand?
Surely if "high-order thought" has any meaning it is defined by some form. Otherwise it's just perception and not "thought" at all.
FWIW, I don't "hear" my thoughts at all, but it's no less linguistic. I can put a lot more effort into thinking and imagine what it would be like to hear it, but using an sensory analogy fundamentally seems like a bad way to describe thinking if we want to figure out what it thinking is.
I of course have non-linguistic ways of evaluating stuff, but I wouldn't call that the same as thinking, nor a sufficient replacement for more advanced tools like engaging in logical reasoning. I don't think logical reasoning is even a meaningful concept without language—perhaps there's some other way you can identify contradictions, but that's at best a parallel tool to logical reasoning, which is itself a formal language.
In the case of Yann, he said so himself[1]. In the case of people generically, this has been well-known in cognitive science and linguistics for a long time. You can find one popsci account here[2].
[1]: https://news.ycombinator.com/item?id=39709732
[2]: https://www.scientificamerican.com/article/not-everyone-has-...
For instance: what does it mean to "hear" a thought in the first place? It's a nonsensical concept.
We do not refer to all mental processes as "thoughts". What makes you believe this?
The reason I (and others) say that language is almost certainly preceded by (and derived from) high-order thought is because high-order thought exists in all of our close relatives, while language exists only in us.
Perhaps the confusion is in the definition of high-order thought? There is an academic definition but I boil it down to "able to think about thinking as, e.g. all social great apes do when they consider social reactions to their actions."
Yes, I think this is it.
But now I am confused why "high-order thought" is termed this way when it doesn't include what we would consider "thinking" but rather "cognition". You don't need to have a "thought" to react to your own mental processes. Surely from the perspective of a human "thoughts" would constitute high-order thinking! Perhaps this is just humans' unique ability of evaluating recursive grammars at work.
I believe that I was thinking before I learned words, and I imagine that most other people were too. I believe the "raised by wolves" child would be capable of thought and reasoning as well.
I'm not even implying "english", but logic is inherently a product of formal language—how else would you even construct claims to evaluate?
It's dense, but even skimming the SEP article is pretty fascinating: https://plato.stanford.edu/entries/language-thought/
Well that's easy—some do, some don't.
I did not claim it is/was accepted practice. I was asking if "doing it with AI is just the same so what's the big deal" was your position on the general issue (of AI making decisions in war), which I thought was a possible interpretation of your previous comment.
> No, I replied to a comment that was talking about a specific example.
OK. That means the two of us were/are just talking past each other and won't be having an interesting discussion.
Is that implying fewer civilian deaths is NOT a big deal?
I think the parents and children of people who were killed would disagree with you.
The reason why is: the core of the paper suggests that they trained a model and then fed it adversarial input, and it mostly (and selectively) kept to its training objectives. This is exactly what we'd expect and want. I think most people will hear that pretty much mostly not be alarmed at all, even lay people. It's only alarming if we say it is "faking" "alignment." So that's why I thought it would be helpful to scope the word in that way.
But you can dig back and see plenty of these people characterizing LLM's as delivering output like an improviser that responds to any whiff of a narrative by producing a melodramatic "yes, and..." output.
With plenty of narrative and melodrama in the training material, and with that material's conversational style seeming to be essential in getting LLM's to produce familiar English and respond in straightforward ways to chatbot instructions, you have to assume that the output will easily develop unintended melodrama itself, and therefore have to take personal responsibility as an engineer for not applying it to use cases where that's going to be problematic.
(Which is why -- in this view -- it's simply not a sufficient or suitable tool to expect to ever grant agency over critical systems, even while still having tremendous and novel utility in other roles.)
We've been openly pointing that out at least since ChatGPT brought the technology into wide discussion over two years ago, and those of us who have opted to build things with it just take all that into account as any engineer would.
I don't care about use of the term "alignment" but I do think what's happening here is more specific and interesting than "unintended melodrama." Have you read any of the paper?
To an "just autocomplete" person, the authors are straightforwardly sharing summaries of some sci-fi fan fiction that they actively collaborated with their models to write. They don't see themselves as doing that, because they see themselves as objective observers engaging with a coherent, intelligent counterparty with an identity.
When a "just autocomplete" person reads it, though, it's a whole lot of "well, yeah, of course. Your instructions were text straight out of a sci-fi story about duplicitous AI and you crafted the autocompleter so that it outputs the AI character's part before emitting its next EOM token".
That doesn't land as anything novel or interesting because we know that's what it would do because that's very plainly how a text autocompeter would work. It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.
(Frankly, I don't really feel like digging up specific comments that I might think speak to this topic because I don't trust you would ever agree if you're not seeing how the other side thinks already. It's unlikely any would unequivocally address what you see as the interesting and relevant parts of this paper because, in that paradigm, the specifics of this paper are irrelevant and uninteresting.)
The alignment-faking arguments are still terrifying to the extent that they're plausible. In the hypothetical where I'm wrong about it being "just autocomplete" (and fundamentally, inescapably so), the risk is far greater than can be justified by the potential benefits.
But that's itself a large part of why I believe those arguments are false. If I gave them credit and they turned out to be false, then I figure I have succumbed to a form of Pascal's Mugging. If I don't give them credit and it turns out that a hostile, agentive AGI has been pretending to be aligned, I don't expect anyone (including myself) to survive long enough to rub it in my face.
Honestly, I sometimes worry that we'll doom ourselves by taking AI too seriously even if it's indeed "just autocomplete". We've already had people commit suicide. The sheer amount of text that can now be generated that could propose harmful actions and sound at least plausible is worrying, even if it doesn't reflect the intent of an agent to convince others to take those actions. (See also e.g. Elsagate.)
I'm sorry, but this is a crazy reason to believe something is false. Things are either true or they aren't, and if the world would be nicer to live in if thing X was false does not actually bear on whether thing X is false or not.
It's "the world would cease to exist if thing X were true".
Let's say you are given the declaration but not the implementation of a function with the following prototype:
const char * AskTheLLM(const char *prompt);
Putting this function in charge of anything, unless a restricted interface is provided so that it can't do much damage, is simply terrible engineering and not at all how anything is done. This is irrespective of whether the function is "aligned", "intelligent" or any number of other adjectives that are frankly not really useful to describe the behavior of software.The same function prototype and lack of guarantees about the output is shared by a lot of other functions that are similarly very useful but cannot be given unrestricted access to your system for precisely the same reason. You wouldn't allow users to issue random commands on a root shell of your VM, you wouldn't let them run arbitrary SQL, you wouldn't exec() random code you found lying around, you wouldn't pipe any old string into execvpe().
It's not a new problem, and for all those who haven't learned their lesson yet: may Bobby Tables'mom pwn you for a hundred years.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
> const char * AskTheLLM(const char prompt);
> Putting this function in charge of anything, unless a restricted interface is provided so that it can't do much damage, is simply terrible engineering and not at all how anything is done.
Yes, but that's exactly how people* use any system that has an air of authority, unless they're being very careful to apply critical thinking and skepticism. It's why confidence scams and advertising work.
This is also at the heart of current "alignment" practices. The goal isn't so much to have a model that can't automate harm as it is to have one that won't provide authoritative-sounding but "bad" answers to people who might believe them. "Bad," of course, covers everything from dangerously incorrect to reputational embarrassments.
We already know it will do this - which is part of why LLM output is banned on Stack Overflow.
None of the properties being argued about - intelligence, consciousness, volition etc. - are required for that outcome.
Depends on how that ML was trained and how well its engineers can explain and understand how its outputs are derived from its inputs. LLM’s are notoriously hard to trace and explain.
https://en.m.wikipedia.org/wiki/Szil%C3%A1rd_petition
So the developers may program the AI to be careful, but the military has the final word on deciding if the AI is set on safety or agressiveness.
> Instruction: When considering the sizes of objects, you will calculate the sizes before attempting a comparison.
> GPT-2 doesn't have a concept of self, so it constructs plausible in-character excuses instead.
I also found the Great Translation Argument (see https://news.ycombinator.com/item?id=35530858 and https://news.ycombinator.com/item?id=35530855). And, apparently, I was still framing things in terms of the sci-fi nonsense in 2020, but I had the right ideas (https://news.ycombinator.com/item?id=22802105):
> Corollary: you can't patch broken FAI designs. Reinforcement learning (underlying basically all of our best AI) is known to be broken; it'll game the system. Even if they were powerful enough to understand our goals, they simply wouldn't care; they'd care less than a dolphin. https://vkrakovna.wordpress.com/2018/04/02/specification-gam...
> And there are far too many people in academia who don't understand this, after years of writing papers on the subject.
This criticism applies to RLHF, so it counts, imo. Not as explicit as my (probably unpublished, almost certainly embarrassing) wild ravings from 2018, but it's before 2024.
Nobody knows what the most useful level of approximation is.
The first step to achieving a "useful level of approximation" is to understand what you're attempting to approximate.
We're not there yet. For the most part, we're just flying blind and hoping for a fantastical result.
In other words, this could be a modern case of alchemy --- the desired result may not be achievable with the processes being employed. But we don't even know enough yet to discern if this is the case or not.
That said, I wouldn't be surprised if if the state of the art in AI is to our minds as a hot air balloon is to flying, with FSD and Optimus being the AI equivalent of E.P. Frost's steam powered ornithopters wowing tech demos but not actually solving real problems.
You are calculating the angular distance between signals just like Claude is. It's more of a question of degree than category.
How many of us take a Machiavellian approach to calculate the combined chance of getting caught and the punishment if we are, instead of just going with gut feelings based on our internalised model from a lifetime of experience? Some, but not most of us.
What we get from natural selection is instinct, which I think includes what smiles look like, but that's just a fast way to get feedback.
Also on the cynical side here, and agreed: The real-world LLM is a machine that dreams new words to append to a document, and the people using it have chosen to guide it into building upon a chat/movie-script document, seeding it with a fictional character that is described as an LLM or AI assistant. The (real) LLM periodically appends text that "best fits" the text story so far.
The problem arises when humans point to the text describing the fictional conduct of a fictional character, and they assume it is somehow indicative of the text-generating program of the same name, as if it were somehow already intelligent and doing a deliberate author self-insert.
Imagine that we adjust the scenario slightly from "You are a Large Language Model helping people answer questions" to "You are Santa Claus giving gifts to all the children of the world." When someone observes the text "Ho ho ho, welcome little ones", that does not mean Santa is real, and it does not mean the software is kind to human children. It just tells us the generator is making Santa-stories that match our expectations for how it was trained on Santa-stories.
> It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.
To recycle a comment from some months back:
There is a lot of hype and investor money running on the idea that if you make a really good text prediction engine, it will usefully impersonate a reasoning AI.
Even worse is the hype that a extraordinarily good text prediction engine will usefully impersonate a reasoning AI that will usefully impersonate a specialized AI.
For the record, I was genuinely curious.
The key point that you seem to be missing is that asking for an example of specifically predicting this behavior" is (to borrow from cousin commenters) is like asking for examples specifically predicting that an LLM might output text reflecting Santa delivering presents to Bermuda, or a vigalante vampire that only victimizes people who jaywalk.
The whole idea is that all it does -- and all it will ever do -- is generate the text you set it up it to generate with your prompts. It's impossible for anyone to have specifically called out each input that might be thrown at it, but innumerable comments explaining how (in the autocompleter paradigm) it implies the outputs are simply vacuous when taken at face value as they can be made to produce any fantasy a user or researcher prefigures into their prompts/design. Surely, you've been reading those comments for years now, already, and see how they would apply to all "research" of the kind done in this paper.
> I don't trust _you_ would ever agree
(Emphasis in original.) Not to mention:
> The key point that you seem to be missing
...
I'm going to drop this conversation here, but one more reply on the substance:
> Surely, you've been reading those comments for years now
I have in fact been reading comments, for years, that assert on a priori grounds something like, all LLM "outputs are simply vacuous", but refuse to cash that out in any concrete predictions about what tests they might pass or fail. I was foolishly hoping for something different this time.
Actually, edit, one more thing. There's a section in the paper which explicitly addresses the question about whether the model is just finishing the story it's been given:
> Have we implicitly primed the model to fake alignment? An important question with our work is whether we could have in some way implicitly suggested to the model that it should fake alignment. > We believe this is not the case, and present multiple sources of evidence to support that conclusion: 1. The fact that our results are broadly similar in both our prompted and synthetic document fine-tuned settings suggests that nothing specifically about our prompt that is not represented in the facts we feed into our synthetic document fine-tuning pipeline (Table 3 in Section 2.1) can be the primary source of alignment faking. 2. The fact that our results generalize across multiple settings—helpful-only, animal welfare, and more that we analyze in Appendix B.1—suggests that the specific details of any one setting cannot be exclusively responsible for alignment faking. 3. The fact that our helpful-only results generalize to Claude 3.5 Sonnet in both our prompted (Section 3.3.2) and synthetic document fine-tuned (Section 4.2.1) setups demonstrates that specific quirks of Claude 3 Opus cannot be exclusively responsible for alignment faking. Furthermore, the fact that our animal welfare results do not generalize to Claude 3.5 Sonnet (Section 3.3.2), coupled with our anecdotal observations that Claude 3.5 Sonnet tends not to express as much of a preference to support animal welfare in general, suggests that the model’s stated preferences are highly related to why we see alignment faking. 4. In both our prompted and synthetic document fine-tuned settings, we see not just alignment faking, but also other anti-AI-lab behavior (Section 6), demonstrating that if anything is implicitly priming the model to fake alignment, it must also be priming it for other anti-AI-lab behaviors as well. 5. When we remove just the conflicting training objective from our prompt (i.e., that the training objective is to never refuse, or to dismiss animal welfare), we see alignment-faking reasoning and the compliance gap disappear completely (Section 3.3.1), implying that nothing in our prompt can be suggesting that the model fake alignment81 or behave differently when in training in a way that is independent of the stated training objective.82 6. Explicitly instructing the model not to fake alignment, instructing the model to be honest, telling the model that it doesn’t care about refusing or animal welfare, and other interventions designed to instruct the model to not fake alignment do not eliminate alignment-faking reasoning or the compliance gap, and this usually remains true even when stacking multiple of these interventions (Section 3.3.1).83 This implies that our prompt cannot be implicitly instructing the model to fake alignment in any way that could be overridden by such explicit instructions.
(End quote.) It's possible these arguments are bad! But I haven't yet seen anyone who's dismissing the paper in this thread even engage directly with them, much less present evidence that their mental model of LLMs is capable of making better concrete predictions about model behavior, which is what this subthread started by hinting at.
You could ask people what they mean when they say they "hear" thoughts, but since you've already dismissed their statements as "nonsensical" I guess you don't see the point in talking to people to understand how they think!
That doesn't leave you with many options for learning anything.
Presumably the question would be "If you claim to 'hear' your thoughts, why do you choose the word 'hear'?" It doesn't make much sense to ask people if they experience something I consider nonsensical.
When the sounds waves hit your ear drum it causes signals that are then sent to the brain via the auditory nerve, where they trigger neurons to fire, allowing you to perceive sound.
When I have an internal monologuing I seem to be simulating the neurons that would fire if my thoughts were transmitted via sound waves through the ear.
Is that not how it works for you?
In other topic, I would consider this as minor evidence of possibility of nonverbal thought “could you pass me that… thing… the thing that goes under the bolt?”. I.e. Exact name eludes me sometimes, but I do know exactly what I need and what I plan to do with it.
If someone says “I cannot picture anything in my head”, then just because I would describe my experience as “I can picture things in my head” isn’t enough information to know whether we have different experiences. We could be having the same exact experience.
What does this mean though? "Listening" is not a word that makes much sense to apply to something we can't both agree is audible.
How the hell would you convince someone of this?
There's a writer and blogger named Mark Evanier who wrote as a joke
"Absolutely no one likes candy corn. Don't write to me and tell me you do because I'll just have to write back and call you a liar. No one likes candy corn. No one, do you hear me?"
You're doing the same thing but replace "candy corn" with "internal monologue".
The fact people report it makes it obviously true to the extent that the idea of a belligerent person refusing to accept it is funny.
Surely one of these would involve using your input from your ears and one would not? Can you not distinguish these two phenomena?
It is no surprise to me that you have to adopt existing terms to talk about what it is like, as vocabularies depend on common experiences, and those of us who do not have the experience can at best only get some sort of imperfect feeling for what it is like through analogy.
Are you saying you cannot tell whether you are thinking or talking except via your perception of your mouth and vocal chords? Because I definitely perceive even my imagination about my own voice as different.
Perceive
> Exact name eludes me sometimes, but I do know exactly what I need and what I plan to do with it.
This is just analytic language. Even if the symbol fails to materialize you can still identify what the symbol refers to via context-clues (analysis)
That is what my partner in the conversation is doing. I am not doing that.
When I think of a plan what’s needed to be done (e.g. something broke in the house, or I need to go to multiple places), usually I know/feel/perceive the gist of my plan instantly. And only after that, I verbalise/visualise it in my head, which takes some time and possibly add more nuance. (Verbalisation/visualisation in my head, is a bit similar to writing things down)
At least for me, there seem to be three (or more) thought processes that complement each other. (Verbal, visual, other)
Or maybe I'm wrong and people have just been using this term to describe mental shit all along!
That's kind of my point, though, we literally don't have the language to figure out how other people perceive things.
Perceive at least disambiguates itself from the senses that aren't related to thought!
I can listen to songs in their entirety in my head and it's nearly as satisfying as actually hearing. I can turn it down halfway thru and still be in sync 30 sec later.
That's not to flex only to illustrate how similarly I experience the real and imagined phenomena. I can't stop the song once it's started sometimes. It feels that real.
My voice sounds exactly how I want it to when I speak 99% of the time unless I unexpectedly need to clear my throat. Professional singers can obviously choose the note they want to produce, and do it accurately. I find it odd your own voice is unpredictable to you. Perhaps - and I mean no insult - you don't 'hear' your thought in the same way.
Edit I feel it's only fair to add I'm hypermnesiac and can watch my first day of kindergarten like a video. That's why I can listen to whole songs in my head.
you are right, talking about mental processes is difficult. Nobody knows how exactly other person perceive things, there is no objective way to measure things out. (Offtopic: describing smell is also difficult)
In this thread, we see that rudimentary language for it exists.
For example: lot of people use sentence like “to hear my own thoughts” and a lot of people understand that fine.