71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...
Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:
Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable
I.e. the agent cannot even know which tests are failing.
It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.
For this reason I find the benchmark a little disconnected from the reality of software engineering.
Another approach might be the LiveBench approach where new tests are released on a regular basis.
I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.
I think that the next step is getting an official "checked" mark by the SWE bench team
I do not want to pay API charges or be limited to a fixed number of "credits" per month.
I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939
But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.
This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.
The approach is to use workloads defined by developers and end users (not providers) that reflect their real-world tasks. E.g. in finance, delivering market snapshots to trading engines. We test full stacks, holding some layers constant so you can isolate the effect of hardware, software, or models. Every run goes through an independent third-party audit to ensure consistent conditions, no cherry-picking of results, and full disclosure of config and tuning, so that the results are reproducible and the comparisons are fair.
In finance, the benchmarks are trusted enough to drive major infrastructure decisions by the leading banks and hedge funds, and in some cases to inform regulatory discussions, e.g. around how the industry handles time synchronization.
Now starting to apply the same principles to the AI benchmarking space. Would love to talk to anyone who wants to be involved?
So the business model would be AI foundries contracting you for evaluating their models?
Do you envision some kind of freely accessible platform for consulting the results?
It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.
For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.
But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.
So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.
I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.
def make_pass@1_agent(agent, n):
def retry_agent(problem):
for attempt in range(n):
result = agent(problem)
if result.success:
return result
return result
return retry_agentThink of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.
If they are running their production product as is, then of course whatever is built into the product is fine.