Why deterministic output from LLMs is nearly impossible(unstract.com) |
Why deterministic output from LLMs is nearly impossible(unstract.com) |
Also the way I usually interpret this "non-deterministic" a bit "broader".
Say i have have slightly different prompts "what's 2+2?" vs. "can you please tell me what's 2 plus 2" or even "2+2=?" or "2+2" for most applications it would be useful if they all produce the same result
2+2 is 4
2 plus 2 is 4
4=2+2
4
Having the LLM pass the input to a tool (python) will result in deterministic output.
- Typical LLM usage involves the accretion of context tokens from previous conversation turns. The likelihood that you will type prompt A twice but all of your previous context will be the same is low. You could reset the context, but accretion of context is often considered a feature of LLM interaction.
- Maybe more importantly, because the LLM abstraction is statistical, getting the correct output for e.g. "3 + 5 = ?" does not guarantee you will get the correct output for any other pair of numbers, even if all of the outputs are invariant and deterministic. So even if the individual prompt + output relationship is deterministic, the usefulness of the model output may "feel" nondeterministic between inputs, or have many of the same bad effects as nondeterminism. For the article's list of characteristics of deterministic systems, per-input determinism only solves "caching", and leaves "testing", "compliance", and "debuggability" largely unsolved.
The only actual nondeterminism is deliberately injected. E.g. the temperature parameter. Without that, it is deterministic but chaotic. This is the case both in training LLMs, and in using the trained models.
If I missed something, someone point it out please.
But largely, you don't really want determinism. Imagine you have equal logprobs for "yes" and "no", which one should go into the output? With temperature 0 and greedy sampling it's going to be the same each time, depending on unrelated factors (e.g. vocabulary order), and your outputs are going to be terribly skewed from what the model actually tries to tell you in the output distribution. What you're trying to solve with LLMs is inherently non-deterministic. It's either the same with humans and organizations (but you can't reset the state to measure it), or at least it depends on a myriad of little factors impossible to account for.
Besides, all current models have issues at temperature 0. Gemini in particular exhibits micro-repetitions and hallucinations (non-existent at higher temps) which it then tries to correct. Other models have other issues. This is a training-time problem, probably unsolvable at this point.
What you want is correctness, which is pretty different because the model works with concepts, not tokens. Try asking it what is 2x2. It might formulate the answer differently each time but good luck making it reply with anything else than 4 on a non-schizophrenic temperature. A bit of randomness won't prevent it from being consistently correct (or consistently incorrect).