Evals will break(wanglun1996.github.io) |
Evals will break(wanglun1996.github.io) |
An eval is not somehow breaking silently due to some new capabilities in an LLM. It wouldn't be a good eval if it did. What it does is steer the LLM towards specific goals. If anything, an argument can be made that they restrict creativity and experimentation by narrowing goals.
If the argument is that evals need to written before some new behavior can be devised, that's incorrect. There are an infinite number of evals that test for things which cannot be done. Only when something has been demonstrated to work in a specific context, can an eval be written.
One thing which was not addressed but will be interesting to discuss would be benchmarks/evals that conflict.
Are there desirable emergent behavior that might not be optimized because the evals penalize them?
For the uninitiated: if you think testing is nothing more than simple operations and assertions, then you don’t know anything important about testing.