Dark Hacker News
new
|
best
|
ask
|
show
|
jobs
bisonbear | Dark Hacker News
user:
bisonbear
created:
September 17, 2025
karma:
38
about:
Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh
submissions
comments
1.
The Opus 4.7 reasoning curve - Medium is the best default?
(stet.sh)
1 points
by
bisonbear
4 days ago
|
0 comments
2.
GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks
(stet.sh)
2 points
by
bisonbear
9 days ago
|
0 comments
3.
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo
(stet.sh)
4 points
by
bisonbear
16 days ago
|
0 comments
4.
I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks
(stet.sh)
2 points
by
bisonbear
30 days ago
|
0 comments
5.
Coding evals are broken. CI is green while AI code quality goes unmeasured
(stet.sh)
1 points
by
bisonbear
32 days ago
|
0 comments
6.
Agents.md is the highest-leverage code you're not testing
(stet.sh)
1 points
by
bisonbear
37 days ago
|
0 comments
7.
Your AI coding benchmark is hiding a 2x quality gap
(stet.sh)
3 points
by
bisonbear
65 days ago
|
0 comments