We might be overestimating coding agent performance on SWE-Bench

We might be overestimating coding agent performance on SWE-Bench(cgft.io)

1 points by kumama 1 year ago | 1 comment

kumama 1 year ago |

Hey everyone! We recently came across a ICLR submission highlighting dataset contamination issues with SWE-Bench. After filtering out those issues, the authors saw the performance of SWE-Agent + GPT-4 drop significantly, from 12.47% to 3.97%.

This led us to think more deeply about SWE-Bench as an evaluation tool. We've put together a blog post that reviews this paper, other relevant research, and also our thoughts on additional gaps in SWE-Bench.

Blog: https://www.cgft.io/blog/swe-bench-evals

Paper: https://openreview.net/forum?id=pwIGnH2LHJ

Would love your thoughts as well! This post isn’t meant to criticize SWE-Bench; it’s still the best dataset out there for evaluating coding agents. Instead, we hope this discussion can spark ideas on how to make it even better!

We might be overestimating coding agent performance on SWE-Bench