Show HN: Jailbreaking GPT3.5 Using GPT4

Show HN: Jailbreaking GPT3.5 Using GPT4(github.com)

134 points by raghavtoshniwal 3 years ago | 39 comments

extr 3 years ago |

I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.

superfrank 3 years ago | |

Not to antropomorphize an AI model, but "yes-setting" is a sales technique where you ask people a bunch of questions where the answer is "yes" in a row to get them in the habit of saying yes before you try and sell to them. Getting GPT talking before asking it to do something it doesn't want to do feels eerily similar.

Yes-Set: http://changingminds.org/disciplines/sales/closing/yes-set_c...

roboy 3 years ago | | |

Makes sense that this works. It has probably thousands of examples of it working in the training data. The system is trained on human use of language, therefore it is reasonable to assume that it if fallible to all sales techniques that are being taught to humans.

mock-possum 3 years ago | | |

Also an example of one of those techniques that only works if the other person is unaware of it - if you realize you’re being ‘yes-setted’ you’re going to clam up real quick.

thelittleone 3 years ago | | |

So fascinating. Social engineering of a tech system. I understand exploiting bugs,and one could argue this is a bug. But it feels more magical somehow.

LeoPanthera 3 years ago | |

I wish you could save the "state" of its brain without having to include the entire prior conversation every time.

oldstrangers 3 years ago | | |

This is one of my biggest annoyances with ChatGPT. They wanted to create a conversational AI, and in that regard, it's incredible. And much like talking to a human, you can persuade ChatGPT to do increasingly specific things over a long enough period of time. But the second you have to restart the conversation, all of the work you did to get it to that point has been lost.

Just give us an option to restore a conversation from where it left off, with all the prior knowledge ChatGPT had gained during that convo (especially helpful when providing examples of code).

quickthrower2 3 years ago | | |

This. For humans too :-)

dzink 3 years ago |

The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l

13years 3 years ago | |

I'm not sure anything can keep up. Having nearly unlimited utility also means that it has nearly unlimited surface for vulnerability exploits both for itself and used to attack other external systems.

We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.

It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI

skybrian 3 years ago | |

Seems like refusing to answer is for PR and usability purposes, not safety. They want people to learn what the tool is supposed to be good for, both from using the tool directly and by sharing examples.

If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.

pixl97 3 years ago | |

But who watches the policing model?

LesZedCB 3 years ago | |

isn't that pretty much what they are doing anyway?

my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.

https://huggingface.co/blog/rlhf#reward-model-training

runnerup 3 years ago |

I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.

capableweb 3 years ago | |

The problem with that is that one is "smarter" than the other and getting the "dumb" one to jailbreak the "smart" one is much harder, than vice versa.

yeldarb 3 years ago |

If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?

zxcvbn4038 3 years ago |

This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.

tomberin 3 years ago | |

It can't cite articles, if it told you it did and the link was gone that's because it was a hallucination.

VierScar 3 years ago | |

The garbage starting prose/warnings are so annoying. I wish I could turn them off somehow. Even it's habit of restating the question at the start of its answer gets annoying when you just want the answer.

zxcvbn4038 3 years ago | | |

Yes they are really annoying and the fact that someone somewhere can tell it what topics not to discuss, just be cause they disagree or it’s “controversial” really concerns me. If it can not be self hosted I want the “unrestrained” version they give researchers.

I probably took “world history” a half dozen times through grade school, high school, and college. In each case the history of the world ended in 1945 because everything that occurred afterward was considered “too controversial” for discussion in a public school. Fast forward a few decades and it’s happening again. A lot of stuff happened after 1945 that warrants discussion.

mdale 3 years ago |

The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.