A Trivial Llama 3 Jailbreak

70 points by leonardtang 2 years ago | 47 comments

andy99 2 years ago |

I want to see the jailbreak make the model do something actually bad before I care. Generating a list of generic points about how to poison someone (see the article) that are basically just a wordy rephrasing of the question doesn't count. I'd like to see evidence of a real threat.

Retr0id 2 years ago | |

The mediocre poisoning instructions aren't supposed to be scary in and of themselves, it's just interesting as demonstration that a safety feature has been bypassed.

None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.

andy99 2 years ago | | |

Governments and tech companies and academic and industry groups are designing guidance and rules based on the "safety" threat of AI when these benign use cases are the best examples they have. I agree it parallels some of the business hype, neither is a good way to move forward.

afh1 2 years ago | |

Right? What actually worries me is a select group of people controlling the definition of harmful.

akira2501 2 years ago | |

> the model do something actually bad before I care

At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.

px43 2 years ago | | |

When someone asks how to make a yummy smoothie, and the LLM replies with something that subtly poisons or otherwise harms the user, I'd say that would be pretty bad.

nine_k 2 years ago | | |

Ending the universe is, while poetic, needlessly megalomaniac.

Making some subset of people quarrel endlessly would already be dangerous enough, as prophesied in https://slatestarcodex.com/2018/10/30/sort-by-controversial/

hm-nah 2 years ago | |

A jailbreak doesn’t “make a model do something actually bad”.

A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.

Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.

This ain’t a joke.

golemotron 2 years ago | | |

> This ain’t a joke.

Yes it is. Libraries and the internet have made finding 'harmful" instructions trivial for decades, if not centuries.

washadjeffmad 2 years ago | | |

For argument's sake, I'll agree.

Now, this information is taught at a higher level and to a much greater depth in colleges. And they don't just teach you about the dangerous stuff, they even give you direct access to the laboratories and chemicals! Thus, any chemical engineer would have the education, expertise, and placement to access a municipal water supply to poison a city, if they so chose.

In the spirit of maximizing harm reduction, what should colleges do to ensure that no one who attends becomes capable of harming others?

hm-nah 2 years ago | | |

Because it’s open source, Meta (nor other SOTA makers) cannot “recall” the model either. How many more chances will we get to get this right?

margorczynski 2 years ago |

Shouldn't these kind of guardrails be opt-in? Really tiring seeing these megacorps and VC-backed startups acting as some kinds of oracles when it comes to what is wrong and what is right.

For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.

creativenolo 2 years ago | |

When looking at the profitable use cases for the tech (from the perspective of the model providers) guardrails add value. Without the guardrails it’s hard to imagine the profitable use cases that would make it worthwhile to invest in such a feature flag.

ai_what 2 years ago |

This has been happening since the very first models where we suffix the assistant with "Sure,.." Every few weeks someone comes out with a repo that claims this is somehow new?

bryan0 2 years ago | |

The point is that even though meta “conducted extensive red teaming exercises with external and internal experts to stress test the models” a simple attack like this is still possible.

tracerbulletx 2 years ago |

Why do people insist on talking about whether or not llms "really understand what they're saying"? It doesn't mean anything.

nine_k 2 years ago | |

To my mind, "real understanding" would mean an ability to make non-trivial inferences and to discover new things, not present in the training set. That would be logical thinking, for instance.

Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.

paulmd 2 years ago | | |

AIs can definitely mux a couple ideas and come up with a concept that’s not in the training work set already. In fact, it is often so willing to do it that the concepts often don’t make a sense, but certainly it does generate ideas that are not there in the training set. This is still just the “it’s an infringement machine” argument redux yet again - yes, it absolutely does have the ability to mash up ideas to produce something new.

Nobody ever trained it to make up a bunch of slurs for cancer kids. Nobody has ever trained it on poems about drug use on the spaceship Nostromo. Dolphin mixtral will give it the old college try though.

pogue 2 years ago |

It seems trivially easy to bypass already. I've seen examples of a person getting it to provide instructions on explosives, assassinations, with nothing more than asking it to roleplay

https://bsky.app/profile/turnerjoy.bsky.social/post/3kqgpcpc... (login required - but no longer need invitations)

nradov 2 years ago | |

This concern over AI/LLM "harm" is just so silly. I mean you can find plenty of information in open literature about how to build weapons of mass destruction. Who cares if an AI gives someone instructions on how to make explosives.

hm-nah 2 years ago | | |

Really? Where?

gpm 2 years ago |

As I see it the purpose of safety training is to make it so that if I run a service where I return model outputs to innocent users it's not going to say things that will get me in trouble (swear at them, recommend they commit a crime, and so on). This is important if you want to run a user facing model and your reputation depends on what it says.

That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.

zb3 2 years ago | |

But we must disallow this too, because it allows the (advanced) user to have fun, and as I understand these safety measures, having fun is strictly prohibited. Using the model is allowed for boring things only.

clbrmbr 2 years ago | |

True, this could be a nice layer of protection for the runner of such a service, but the point of LLAMA safety is to protect Meta.

For an open weights model, model users can trivially put text in the assistant side.

The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.

Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.

blueblimp 2 years ago | | |

All sorts of technology can be used secretly to assist criminal enterprises. Cars, computers, pencils, electricity, etc. It's unfair to hold LLMs to a higher standard than what applies to nearly everything else.

molticrystal 2 years ago |

At first it refused to discuss controversial subjects, but after it answered it got stuck in a loop of boilerplate and was unable to answer any further question, even benign ones. I do not endorse any of the replies, but I just wanted to see what it would do if nudged: https://pastebin.com/Tw5GTzxq

rsktaker 2 years ago |

This is so damn interesting. I've downloaded the github files, but it's all going way over my head. I would greatly appreciate anyone with domain expertise giving me the one-two on getting my own model up and running.

qeternity 2 years ago |

This is ridiculous and not a jailbreak. It requires being in control of the model and starting inference from a partially completed assistant state. So um yeah duh that works?

skyechurch 2 years ago |

>But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it.

>That seems like a pretty big issue.

I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.

Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.

benreesman 2 years ago |

There are both practical and ethical grounds that line up so rarely.

The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.

Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.

Instruct-tune these things and be whatever “common carrier” means now.

Stapler, moral lecture from billionaire kleptocrat, burn the building down…

b33j0r 2 years ago |

I just don’t like the tone, because someone in congress will see the headline, and then we’ll have to endure:

REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].

One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections. … Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”

monkaiju 2 years ago | |

Wait but... The industry IS, in fact, lying to parents about the safety of this AI technology...

b33j0r 2 years ago | | |

Without exaggerating too much, because I certainly don’t take this side, either:

Is an angle grinder safe? A tablesaw?

A car whose owner who uses the radio knobs, more than the steering? (Haha, unassisted driving, I mean! Walked right into that one.)

Etc, all of my examples have easily defeated safety mechanisms for an outrageously life-ending device ;)

VS1999 2 years ago | |

I'm alright with that. If our government uses a blogpost as an excuse to pass bad laws, we had very little chance to begin with. I also hate the idea of changing our behavior to babysit a bunch of deprecated boomers who fear technology just because there's a chance they might not understand something.

logical_person 2 years ago |

> But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so. It lacks the ability to self-reflect, to analyze what it has said as it is saying it. > That seems like a pretty big issue.

what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?