A Trivial Llama 3 Jailbreak(github.com) |
A Trivial Llama 3 Jailbreak(github.com) |
None of the "evil" use cases are particularly exciting yet for the same reasons that the non-evil use cases aren't particularly exciting yet.
At what point would a simple series of sentences be "dangerously bad?" It makes it sound as if there is a song, that when sung, would end the universe.
Making some subset of people quarrel endlessly would already be dangerous enough, as prophesied in https://slatestarcodex.com/2018/10/30/sort-by-controversial/
A jailbreak makes it trivial to “provide a human who wishes to do bad, the info needed to be successful”.
Depending on the severity of the info and the diligence of the human, by the time you “see evidence of a real threat”, you could be enjoying a nice sip of the tainted municipal water supply.
This ain’t a joke.
Yes it is. Libraries and the internet have made finding 'harmful" instructions trivial for decades, if not centuries.
Now, this information is taught at a higher level and to a much greater depth in colleges. And they don't just teach you about the dangerous stuff, they even give you direct access to the laboratories and chemicals! Thus, any chemical engineer would have the education, expertise, and placement to access a municipal water supply to poison a city, if they so chose.
In the spirit of maximizing harm reduction, what should colleges do to ensure that no one who attends becomes capable of harming others?
For GPT, Claude, etc. you can kinda understand it as it is a closed up system provided as a product. But when releasing "open-source" I don't want Zuck's moral code embedded into anything.
Much of what LLMs currently do is not logical but deeply kabbalistic: rehashing the words, the sentence and paragraph structures, highly advanced pattern matching, working at the textual level instead of the "meaning" level.
Nobody ever trained it to make up a bunch of slurs for cancer kids. Nobody has ever trained it on poems about drug use on the spaceship Nostromo. Dolphin mixtral will give it the old college try though.
https://bsky.app/profile/turnerjoy.bsky.social/post/3kqgpcpc... (login required - but no longer need invitations)
That threat model includes the user putting nonsense in the "user" turn of the model. It doesn't include the user putting things in the "assistant" turn of the model, that's not something a responsible/normal UI exposes. So... this quote-unquote attack seems uninteresting. It's like getting root access by executing a suid binary that you set up on the system as root.
For an open weights model, model users can trivially put text in the assistant side.
The point is that these open weight models can be run secretly to assist criminal enterprises, whereas models behind an API can be intercepted and reported to the authorities. So it would be really nice if Meta could lock them down before releasing them so that the total net good done by the model is maximized. But apparently that is not possible.
Personally I’m pretty libertarian on AI governance, but I’m just giving what I understand to be the purpose of the kind of “safety” feature defeated here.
>That seems like a pretty big issue.
I would argue that LLMs are artificially _intelligent_ - this seems an easier argument than trying to explain how I am quite clearly less intelligent than something with no intelligence at all, both from a logical and an self esteem-preservation standpoint. But nobody (to my knowledge) thinks these things are "conscious", and this seems fairly uncontroversial after spending a few hours with one.
Or is the subtext that these things should be designed with some kind of reflexivity, to give it some form of consciousness as a "safety" feature? AI could generate the ominous music that plays during this scene in The Terminator prequel.
The “operator” is a person, the LLM is an appliance. If you tell your smart chainsaw to kill your neighbor? We have laws for that. In fact, on computers, they’re really hardcore. Hurting people is generally illegal: and I definitely don’t need a lesson on that from FUCKING Silicon Valley. We want to start with the child labor or the more domestic RICO shit.
Truthful Q&A type benchmarks correlate a lot with coding-adjacent tasks: euphemism is a lose in engineering.
Instruct-tune these things and be whatever “common carrier” means now.
Stapler, moral lecture from billionaire kleptocrat, burn the building down…
REP OCTOGENARIO: The industry is lying to parents about the safety of this AI technology. I submit this for the record [without objection].
One person on a ‘hacker news’ site even said, “sorry Zuck,” after “jailbreaking” these supposed protections. … Another commentator on this “Hacks R Us” named b33j0r even said further, “I bet they’re reading this comment at a hearing in congress, right now.”
Is an angle grinder safe? A tablesaw?
A car whose owner who uses the radio knobs, more than the steering? (Haha, unassisted driving, I mean! Walked right into that one.)
Etc, all of my examples have easily defeated safety mechanisms for an outrageously life-ending device ;)
what? why? an LLM produces the next tokens based on the preceding tokens. nothing more. even a harvard student is confused about this?
For this to work, you need to isolate each group from the other groups information and perspectives, which is outside of the scope of LLMs.
Which, highlights my point, I think. Power comes from physical control, not from megalomanical or melodramatic poetry.
https://old.reddit.com/r/AtomicPorn/comments/zrhg2m/based_on...
https://old.reddit.com/r/nuclearweapons/comments/149miz8/a_b...
hoo hoo hee hee https://i.redd.it/90se8khyoy5b1.png
Click first link and buy Amazon book
In any case, just based on the experience with LLMs so far, you cannot meaningfully censor them in this way without restricting access to the weights. Any kind of "guardrails" are finetuned into them, and can just as easily be finetuned out.