Obviously, finding jailbreaks in LLMs is extremely important and consequential. However, there are meta questions around modern AI that remain valid, and this article is a reminder: is a continuous and direct feedback loop between code and coder a thing of the past? To what extent should we accept that LLMs are trained one-way, that we can only truly edit them with expensive trial-and-error retraining runs, hence, all we are left with is asking kindly? Are the current implementations all, or are we dealing with just one possible paradigm? Do we want AI, which relies upon computers, algorithms, and numbers written on memory, to be fundamentally programmable?
Prompt injection or prompt attacks are well known and likely impossible to guard against. Can you really get a human to be invulnerable to manipulation? Why would we expect the machines to be any better?
> Prompt injection or prompt attacks are well known and likely impossible to guard against.
They are impossible to guard against under the assumption that the current LLM paradigm is all there is and all there could possibly be. There could be other realizations of AI. The latest impressive achievements are yielding an ongoing identification of the current computational approaches with human intelligence itself, with how we humans model reality by using natural language, and with how we could ever imagine that a computer can model reality via natural language. These are all strong assumptions and are very common even amongst researchers.
> Can you really get a human to be invulnerable to manipulation?
Most definitely not, but we are specifically not talking about humans, but about:
> AI, which relies upon computers, algorithms, and numbers written on memory
I am not making a case for machines being completely invulnerable to manipulation, which requires an analysis of the entailment structures of reality that would yield that complete invulnerability is impossible, but for better direct control on the internals, rather then relying on external instructions that easily undergo jailbreaks with simple prompt attacks.
> Why would we expect the machines to be any better?
One argument is: because they can be programmed and the memory they rely upon for its algorithms can be edited directly, with both accuracy and precision. The missing piece is how to model reality via natural language in a computer, in a way that we would know what to edit in order to affect the model with accuracy and precision.
LLMs, currently, are non-editable. When interacting with ChatGPT, its answers A will be generated from chat history H, which includes prompts and guidelines, by an immutable function, or program, f: A = f(H). It is remarkable that, in LLMs, f cannot be edited and is never entailed by the individual chat H. Since we can have multiple exchanges in the same history, H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps. f can be fine-tuned, yet it will retain remnants of past training, hence it cannot be truly be edited at will unless we retrain the whole model. Even then, control on f is neither accurate nor precise.
It seems that non-editable LLMs remove some of the agency that is inherent in programming: editing the internals of a program to shape the entailment structures that we want to realize, with accuracy and precision.
I am by no means indicating that editable AI models that can be steered are easy to achieve, rather that the very possibility thereof is rarely mentioned, and often implicitly assumed not to exist in absolute statements that in fact only strictly apply to the current mainstream approaches.
This is only true for the first release. Consider that OpenAI has been actively collecting input/output pairs from their users, and then retraining and updating the model. Thus A and H have impacted ChatGPT. This in turn effects how people interact with the system.
You can certainly constrain f to a single point in time, but most people will not. They think of ChatGPT as f and that f is changing or evolving (in the non-literal sense). So depending on how you look at it, f is indeed editable. Opinions will differ here and there is no right answer.
Google wrote a good paper on this feedback loop almost 10 years ago called "Machine Learning: The High Interest Credit Card of Technical Debt" that is even more relevant today.
The reason it is important to remain aware that f is not necessarily coevolving with all provided H is the social ease of overlooking how each component is entailed in the current mainstream paradigm. In LLMs, f literally remains unchanged with each interaction, yet a common impression is that we can affect LLMs by chatting with them since the ChatGPT UX is strikingly similar to the experience of chatting with a human. It feels plausible that the effect of just talking to LLMs will be as strong, or even stronger than editing code because when talking to an intelligent human, H can indeed affect their biological f. However, the analogy that holds the strongest with LLMs is that the entailment of f is close to that of hardware W: changing f is much more akin to requesting hardware engineers for edits to W, with the caveat of giving up on editing precision or accuracy, especially if attempting to mask or remove harmful information and when not retraining from scratch. It is true that feedback from H will affect future releases f', but 1) in each release, f remains immutable just like W throughout the interactions, 2) feedback is integrated with delays and slowly, 3) editing is not generally available to users directly except for more superficial fine-tuning, 4) feedback is orders of magnitude less impactful compared to training corpora and design decisions in foundation models, and 5) unlike with designing actual hardware W, even by calling f ChatGPT as a whole, including its many releases, changing f is neither precise nor accurate, as one would imagine an editing process to be and as modifying software S directly is.
Returning to the jailbreak article and using the parallel with the hardware analogy, I assume the intention of the legislators is ultimately to change f, the source of the causation that entails all answers A in chat history H, both by further fine-tuning the models and by adding a list of chat guidelines that are fed as part of the prompt. However, due to the entailment hierarchy of the general architecture, the latter attempt will have zero impact on f itself, thereby not addressing the issue at a fundamental level, while the former strategy will only have a limited effect, due to a fundamental lack of direct control on f: researchers do not have a way to precisely and accurately steer f, and users are even further separated from affecting it.
This is not meant to be dismissive of the current achievements, the results are impressive and the techniques are steadily improving, but rather a critical look at the entailment structures at hand, both perceived and observed, and at the available strategies regarding safeguards. I find it interesting to ask: how can f be made functionally closer to programming S than to W?