Better Models: Worse Tools(lucumr.pocoo.org) |
Better Models: Worse Tools(lucumr.pocoo.org) |
The curl command is extremely popular so models seem to be really good at using it.
Also I like that curl uses a bash syntax and my platform requires JSON payloads; it makes the separation clear to the agent. I find it to be very reliable.
Is this still a thing? I thought Anthropic walked back the silent downgrades so now all the different domains downgrade non-silently.
But:
"Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."
Only implicitly?
--
Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.
Everyone who tried this in MUD/MUSH/MOO clients ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.
The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.
> My strongest hypothesis is that this is not random deterioration but a training artifact. [...] Anthropic’s own client appears to expect and accept a fair amount of slop and repairs it, mostly silently
> If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward.
> Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape.
> Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away.
Great article.
Interesting root cause hypothesis. Couldn't one simply strip the slop-handling from the RL env's harness to avoid this though?
I do agree on the walled garden being built here. Proprietary frontier models performing best in proprietary harnesses makes sense for Anthropic's interests.
- All models are terrible at generating line numbers for a proper diff, give up on them
- Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available
- Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc
Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.
Doesn't always work, for better performance you can kneel and start begging
It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.