To solve the benchmark crisis, evals must think | Dark Hacker News