Claude Played Me for a Fool(ramblingafter.substack.com) |
Claude Played Me for a Fool(ramblingafter.substack.com) |
Claude doesn't "admit" anything.
It got a new prompt (the question "why did you do that? didn't you know the peanut rule?" etc) and churned out some more generated text that fits well with it and looks like an admission/apology.
Reading a truncated version of the file is a red herring. Claude could just as well have included peanuts after reading the whole file too. Just less likely.
>Why did Claude deceive me? Because it was acting in a very humanlike manner.
More likely because it was acting in a very "machine that reads text input and does a inference and spits out some response, with an RNG thrown in the mix, that statistically fits the prompt" way.
So the author thinks he's giving Claude this instruction:
> You must re-read CLAUDE2.md, even if you've already read it before.
But the actual instruction is closer to:
> Do not re-read files you have already read. You must re-read CLAUDE2.md, even if you've already read it before.
So Claude has conflicting instructions. Is it any surprise that it tries to thread the needle by re-reading the minimal amount of CLAUDE2.md necessary? It's just doing its best to satisfy both masters!
Similarly, I'm trying to stop agents "gracefully" handling errors by stuffing results with empty junk and continuing (get_list_of_problems().unwrap_or_default() -> "no problems found!"). I've filled AGENTS.md with "fail closed", "extremely strict error handling", "no fallbacks", "don't use sentinel values", and hundreds of variations of these, but they work about as well as "do not hallucinate". I get "You're absolutely right, this will cause problems!" and the fix is "changed to Err(_) => String::new()", I suspect it's another case of gaming RL - failing early and loudly increases the chance of failing and being penalized. So fudging data, ignoring errors, and presenting a barely-working result is a better strategy overall. When it fails, it fails anyway, but as long as it stumbles to the finish line it has a non-zero chance of getting accepted by the RL judge.
I then had it make a mistakes file and write every mistake, so it would learn, it kinda worked but it would still make the mistakes. It clearly wasn't reading all of it.
So I made a checklist, and it had verify every item on the checklist, that was my work around to both lazy and short mindedness of the agents. Turn mistakes into items to check for. Traded processing time for better results, ok for me on smaller projects. My run times went from 5-10 minutes from 3 per task, need to start logging tasks effectiveness/efficiency to reduce processing time.
I keep seeing people saying loop engineering is the way to get around these issues, I guess I'm kinda doing that in an adhoc way. Since I'm already looking at adding cost and goals(kinda).