Evaluating 55 LLMs with GPT-4(benchmarks.llmonitor.com) |
Evaluating 55 LLMs with GPT-4(benchmarks.llmonitor.com) |
If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?
Is this our Concorde moment?