Ask HN: What are some good benchmarks for different agent harnesses? Other than terminal bench which doesnt quite map to my experience, what are some other benchmarks to see how different models do in different harnesses? |
Ask HN: What are some good benchmarks for different agent harnesses? Other than terminal bench which doesnt quite map to my experience, what are some other benchmarks to see how different models do in different harnesses? |
https://www.vals.ai/benchmarks/vibe-code
https://www.vals.ai/benchmarks/swebench
https://www.vals.ai/benchmarks/terminal-bench-2-1 (vals customized terminal bench 2.0)