Dark Hacker News
new
|
best
|
ask
|
show
|
jobs
bisonbear | Dark Hacker News
user:
bisonbear
created:
September 17, 2025
karma:
43
about:
Building evals for AI coding agents, on your repo. Tests pass. Nobody's measuring the rest. http://stet.sh email ben@stet.sh
submissions
comments
1.
I evaluated GLM 5.2 against the frontier on tasks from real repos
(stet.sh)
2 points
by
bisonbear
12 days ago
|
2 comments
2.
I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos
(stet.sh)
3 points
by
bisonbear
29 days ago
|
0 comments
3.
I used autoresearch to improve my AGENTS.md, measured against real tasks
(stet.sh)
8 points
by
bisonbear
36 days ago
|
7 comments
4.
A brief investigation into the GPT-5.5 regression claims
(stet.sh)
1 points
by
bisonbear
44 days ago
|
0 comments
5.
The Opus 4.7 reasoning curve - Medium is the best default?
(stet.sh)
1 points
by
bisonbear
50 days ago
|
0 comments
6.
GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks
(stet.sh)
2 points
by
bisonbear
55 days ago
|
0 comments
7.
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo
(stet.sh)
4 points
by
bisonbear
62 days ago
|
0 comments
8.
I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks
(stet.sh)
2 points
by
bisonbear
76 days ago
|
0 comments
9.
Coding evals are broken. CI is green while AI code quality goes unmeasured
(stet.sh)
1 points
by
bisonbear
78 days ago
|
0 comments
10.
Agents.md is the highest-leverage code you're not testing
(stet.sh)
1 points
by
bisonbear
83 days ago
|
0 comments