Why deterministic output from LLMs is nearly impossible

Why deterministic output from LLMs is nearly impossible(unstract.com)

26 points by naren87 326 days ago | 16 comments

kazinator 326 days ago |

This is a SaaS problem, not a LLM problem. If you have a local LLM that nobody is upgrading behind your back, it will calculate the same thing on the same inputs. Unless there is a bug somewhere, like using uninitialized memory, the flaoting-point calculations and the token embedding and all the rest do the same thing each time.

Cilvic 326 days ago | |

So could SaaS LLM or cloud/api LLMs not offer this as an option? A guarantee that the "same prompt" will always produce the same result.

Also the way I usually interpret this "non-deterministic" a bit "broader".

Say i have have slightly different prompts "what's 2+2?" vs. "can you please tell me what's 2 plus 2" or even "2+2=?" or "2+2" for most applications it would be useful if they all produce the same result

alphan0n 326 days ago | | |

The form of the question determines the form of the outcome, even if the answer is the same. Asking the same question in a different way should result in the adherence to the form of the question.

2+2 is 4

2 plus 2 is 4

4=2+2

Having the LLM pass the input to a tool (python) will result in deterministic output.

nativeit 326 days ago | |

Doesn’t that imply that LLMs are just “if then, then that” but bigger?

ezst 326 days ago | | |

Sure, why would you expect it to be different?

lsy 326 days ago |

There are two additional aspects that are even more critical than the implementation details here:

- Typical LLM usage involves the accretion of context tokens from previous conversation turns. The likelihood that you will type prompt A twice but all of your previous context will be the same is low. You could reset the context, but accretion of context is often considered a feature of LLM interaction.

- Maybe more importantly, because the LLM abstraction is statistical, getting the correct output for e.g. "3 + 5 = ?" does not guarantee you will get the correct output for any other pair of numbers, even if all of the outputs are invariant and deterministic. So even if the individual prompt + output relationship is deterministic, the usefulness of the model output may "feel" nondeterministic between inputs, or have many of the same bad effects as nondeterminism. For the article's list of characteristics of deterministic systems, per-input determinism only solves "caching", and leaves "testing", "compliance", and "debuggability" largely unsolved.

redsymbol 326 days ago |

There may be something I do not understand about LLMs. But it seems it is more correct to say LLMs are chaotic - in the mathematical sense of sensitive dependence on initial conditions.

The only actual nondeterminism is deliberately injected. E.g. the temperature parameter. Without that, it is deterministic but chaotic. This is the case both in training LLMs, and in using the trained models.

If I missed something, someone point it out please.

jqpabc123 326 days ago |

Probabilistic processes are not the most appropriate way to produce deterministic results. And definitely not if the system is designed to update, grow or "learn" from inputs.

orbital-decay 325 days ago |

The author read the docs but never experimented, so they don't seem to have intuition behind the theory. For example, Gemini Flash actually seems to have deterministic outputs at temp 0, despite the disclaimer in the docs. Clearly Google has no trouble making it possible. Why don't they guarantee it, then? For starters it's inconvenient due to batching, you can see that in Gemini Pro which is "almost" deterministic but the same results are grouped together. It's a SaaS problem, if you run a model locally it's much easier to make it deterministic than presented in the article, and definitely not nearly impossible. It's going to cost you more, though.

But largely, you don't really want determinism. Imagine you have equal logprobs for "yes" and "no", which one should go into the output? With temperature 0 and greedy sampling it's going to be the same each time, depending on unrelated factors (e.g. vocabulary order), and your outputs are going to be terribly skewed from what the model actually tries to tell you in the output distribution. What you're trying to solve with LLMs is inherently non-deterministic. It's either the same with humans and organizations (but you can't reset the state to measure it), or at least it depends on a myriad of little factors impossible to account for.

Besides, all current models have issues at temperature 0. Gemini in particular exhibits micro-repetitions and hallucinations (non-existent at higher temps) which it then tries to correct. Other models have other issues. This is a training-time problem, probably unsolvable at this point.

What you want is correctness, which is pretty different because the model works with concepts, not tokens. Try asking it what is 2x2. It might formulate the answer differently each time but good luck making it reply with anything else than 4 on a non-schizophrenic temperature. A bit of randomness won't prevent it from being consistently correct (or consistently incorrect).