Train-Before-Test: One Simple Fix That Makes LLM Benchmark Rankings Agree

Train-Before-Test: One Simple Fix That Makes LLM Benchmark Rankings Agree(ghzhang233.github.io)

2 points by taegee 70 days ago | 1 comment

taegee 70 days ago |

"Model A wins on MMLU. Model B wins on ARC-Challenge. Model C wins on HellaSwag.

At some point you stop trusting any of them—not because benchmarks are meaningless, but because no two of them seem to tell the same story about which model is actually better.

[…]

We found a fix. It’s called Train-before-Test."

taegee 70 days ago |

"Model A wins on MMLU. Model B wins on ARC-Challenge. Model C wins on HellaSwag.

At some point you stop trusting any of them—not because benchmarks are meaningless, but because no two of them seem to tell the same story about which model is actually better.

[…]

We found a fix. It’s called Train-before-Test."