I built a small benchmark to test CLI coding agents on blind bug detection. A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments. Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs. Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%. LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing. Full dataset is linked in the docs. |
No comments yet