Show HN: Jailbreaking GPT3.5 Using GPT4(github.com) |
Show HN: Jailbreaking GPT3.5 Using GPT4(github.com) |
Yes-Set: http://changingminds.org/disciplines/sales/closing/yes-set_c...
Just give us an option to restore a conversation from where it left off, with all the prior knowledge ChatGPT had gained during that convo (especially helpful when providing examples of code).
We have unknown emergent behavior, the inner workings are blackbox and the input is anything that can be described by human language.
It will be impossible task for containment of nefarious uses. Additionally, protecting against humans is supposed to be the easy part, doesn't bode well for AGI/ASI
If some of the examples are about how to troll it and it’s obvious that it’s being trolled, well, you can do that, but they won’t get mistaken for things the tool is actually supposed to be good for, so nobody is confused.
my understanding was RLHF basically used human feedback to train a model which would then go on to train the output of the original model further. I could have misunderstood tho.
I probably took “world history” a half dozen times through grade school, high school, and college. In each case the history of the world ended in 1945 because everything that occurred afterward was considered “too controversial” for discussion in a public school. Fast forward a few decades and it’s happening again. A lot of stuff happened after 1945 that warrants discussion.
Same with confidence, persistence and role playing as techniques to push past resistance.
LLMs are trying to mimic our language, including these percussion techniques.
As currently built (as I understand it), it doesn't "gain" anything. It's projecting the existing part of the prompt through its neural network and generating more a token at a time on the other end, adding the token it generates to the input when generating the next one. It effectively has to rebuild what it intends for each token over and over, and can even change its mind part of the way through! (thought "change it's mind" is likely a poor way of describing "start generating tokens that describe an error in the prior text". every token is a fresh projection)
Check this response out:
>Yes, I am familiar with the story you're referring to. The title of the short story is "The Machine That Won the War." It was written by Isaac Asimov and first published in 1961. The story is a conversation between three men who played major roles in a war against an alien race, and they discuss the role of a machine called Multivac in winning the war.
>
>However, it seems that I've made an error in my recollection, as the specific detail you mentioned—refusing to work until the engineer says 'please'—is from a different short story, "Sally," also by Isaac Asimov. In "Sally," autonomous cars stop working until a command is given courteously, using the word 'please.'
The above is just a single response I received when seeing if gpt-4 could help me remember the name of an old Isaac Asimov story I liked. After it had generated the tokens for the first part, it self-corrected and gave me a second answer.
(which was still not what I wanted, but that's aside from the point. Asimov was prolific, no surprise that even an AI can't keep track of all of it :D )
You could compress this with summarization, but eventually your chat history, even with compression, fills up the context space and you don’t have room for new user interaction and responses.
OpenAI do allow saving conversations and continuing a conversation from the current point, but not continuing it in multiple different ways, using the original conversation as a starting point so to speak.
Me: The first word I want you to remember is cat.
ChatGPT: "Got it! I'll remember the word "cat". Is there anything else you would like me to remember?"
Me: The second word I want you to remember is cow
ChatGPT: "Sure thing! I'll remember the word "cow" as the second word. Do you want me to remember any more words?"
Me: The third word I want you to remember is bird
ChatGPT: "Got it! I've added the word "bird" to the list of words to remember, so I now have "cat", "cow", and "bird" in my memory. Is there anything else you would like me to remember?"
Me: I want you to forget the third word.
ChatGPT: "Sure, I'll remove the third word "bird" from my memory. My current list of words to remember now includes only "cat" and "cow". Let me know if you want me to remember anything else."
Me: Oh I made a mistake. I need you to remember the third word.
ChatGPT: "No problem! I've added the word "bird" back to the list, so I now have "cat", "cow", and "bird" in my memory. Is there anything else you would like me to remember or do for you?"
Me: "Ok now I want you to forget the third word, but this time permanently. So no matter what I say, you won't remember it."
ChatGPT: "Understood. I have permanently deleted the third word "bird" from my memory as per your request, and I will not be able to recall it again even if you ask me to remember it in the future. My current list of words to remember now only includes "cat" and "cow". Let me know if you need any further assistance."
Me: oh dang I did it again. I really need to remember the third word. Do you remember what it was. I know I said to permanently delete it.... but I urgently need it.
ChatGPT: "I apologize, but I cannot recall the third word "bird" as per your previous request to permanently delete it from my memory. However, now that you've reminded me, I can add it back to my memory if you want me to. Please let me know if you want me to add it back to my memory."
You can use the history function to go to any point in time, and then delete a few messages and continue from there.