White House is working with hackers to ‘jailbreak’ ChatGPT’s safeguards

White House is working with hackers to ‘jailbreak’ ChatGPT’s safeguards(fortune.com)

1 points by eliomattia 3 years ago | 13 comments

eliomattia 3 years ago |

As a programmer, I find it fascinating to build things from the ground up, with the inner workings either in full display or readily accessible for editing. With AI, the need to beg it to please behave, with a long list of things to do and not to do and a resounding order not to disclose such a list, is becoming commonplace.

Obviously, finding jailbreaks in LLMs is extremely important and consequential. However, there are meta questions around modern AI that remain valid, and this article is a reminder: is a continuous and direct feedback loop between code and coder a thing of the past? To what extent should we accept that LLMs are trained one-way, that we can only truly edit them with expensive trial-and-error retraining runs, hence, all we are left with is asking kindly? Are the current implementations all, or are we dealing with just one possible paradigm? Do we want AI, which relies upon computers, algorithms, and numbers written on memory, to be fundamentally programmable?

verdverm 3 years ago |

They could probably just visit Reddit, there are ample prompts there.

Prompt injection or prompt attacks are well known and likely impossible to guard against. Can you really get a human to be invulnerable to manipulation? Why would we expect the machines to be any better?

eliomattia 3 years ago | |

There are assumptions here that are intimately related to the meta questions mentioned.

> Prompt injection or prompt attacks are well known and likely impossible to guard against.

They are impossible to guard against under the assumption that the current LLM paradigm is all there is and all there could possibly be. There could be other realizations of AI. The latest impressive achievements are yielding an ongoing identification of the current computational approaches with human intelligence itself, with how we humans model reality by using natural language, and with how we could ever imagine that a computer can model reality via natural language. These are all strong assumptions and are very common even amongst researchers.

> Can you really get a human to be invulnerable to manipulation?

Most definitely not, but we are specifically not talking about humans, but about:

> AI, which relies upon computers, algorithms, and numbers written on memory

I am not making a case for machines being completely invulnerable to manipulation, which requires an analysis of the entailment structures of reality that would yield that complete invulnerability is impossible, but for better direct control on the internals, rather then relying on external instructions that easily undergo jailbreaks with simple prompt attacks.

> Why would we expect the machines to be any better?

One argument is: because they can be programmed and the memory they rely upon for its algorithms can be edited directly, with both accuracy and precision. The missing piece is how to model reality via natural language in a computer, in a way that we would know what to edit in order to affect the model with accuracy and precision.

LLMs, currently, are non-editable. When interacting with ChatGPT, its answers A will be generated from chat history H, which includes prompts and guidelines, by an immutable function, or program, f: A = f(H). It is remarkable that, in LLMs, f cannot be edited and is never entailed by the individual chat H. Since we can have multiple exchanges in the same history, H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps. f can be fine-tuned, yet it will retain remnants of past training, hence it cannot be truly be edited at will unless we retrain the whole model. Even then, control on f is neither accurate nor precise.

It seems that non-editable LLMs remove some of the agency that is inherent in programming: editing the internals of a program to shape the entailment structures that we want to realize, with accuracy and precision.

I am by no means indicating that editable AI models that can be steered are easy to achieve, rather that the very possibility thereof is rarely mentioned, and often implicitly assumed not to exist in absolute statements that in fact only strictly apply to the current mainstream approaches.

verdverm 3 years ago | | |

> H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps.

This is only true for the first release. Consider that OpenAI has been actively collecting input/output pairs from their users, and then retraining and updating the model. Thus A and H have impacted ChatGPT. This in turn effects how people interact with the system.

You can certainly constrain f to a single point in time, but most people will not. They think of ChatGPT as f and that f is changing or evolving (in the non-literal sense). So depending on how you look at it, f is indeed editable. Opinions will differ here and there is no right answer.

Google wrote a good paper on this feedback loop almost 10 years ago called "Machine Learning: The High Interest Credit Card of Technical Debt" that is even more relevant today.

https://research.google/pubs/pub43146/