Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation(arxiv.org)

59 points by wek 4 hours ago | 34 comments

jdlshore 2 hours ago |

“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.

qsort 58 minutes ago | |

I think it's downstream of "you can't optimize for two different objectives".

If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.

If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.

apsurd 55 minutes ago | | |

Would you mind sharing antirez' suggestion?

jeremyjh 55 minutes ago | |

Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.

I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.

Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.

sigbottle 10 minutes ago | | |

Wait isn't gpt 5.2 good? Or is it not thinking / not codex? 5.2 was what sparked the late 2025 openai agentic programming revolution.

nijave 38 minutes ago | |

Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.

xienze 20 minutes ago | |

> their performance drops when forced to navigate explicit architectural rules

Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.

p0w3n3d 1 hour ago |

   tasks spanning eight web frameworks

Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?

bob1029 18 minutes ago | |

I think web frameworks have been "in trouble" as of gpt-5.4. I can't imagine using something like React anymore.

The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).

maxbond 2 hours ago |

Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.

I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.

[1] https://arxiv.org/abs/2604.15597

Discussion: https://news.ycombinator.com/item?id=48073246

emp17344 1 hour ago | |

If it’s not easily verifiable, LLMs aren’t good at it.

jeremyjh 1 hour ago | | |

I think that’s mostly because they get so much more of that reinforcement learning - since it is so economical. I dont know if there is any evidence of a fundamental reason they can’t be just as good at other tasks, but it might be economically infeasible for awhile yet.

dwa3592 1 hour ago |

This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).

whatever1 1 hour ago | |

This is not a new problem though. This is why we started writing modular code, strict interfaces etc

lanstin 1 hour ago | | |

And doing incremental dev, so once a feature is done you can mostly ignore it.

bob1029 1 hour ago |

> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.

I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.

The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.

richardlblair 1 hour ago | |

I find the same. We have abstractions with multiple concrete implementations, examples of patterns and examples of anti patterns.

I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.

xcjsam 51 minutes ago | |

The harness mattering more than the model lines up with my experience too. What this paper measures is within-turn constraint decay. The version that bites in multi-agent setups is across-session — the architectural rules an agent wrote down on Monday don't reach the agent making the next change on Tuesday.

yomismoaqui 1 hour ago |

Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.

When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.

acbart 1 hour ago | |

It's crazy to me that people think of Python as dynamically typed by default. Strong static typing has been an option in Python for years now, and it should just be the default.

epgui 1 hour ago | | |

The python type hints are useful for static analysis (and yes, should be the default) but it’s a joke compared to the utility of types in a language like Haskell.

gkfasdfasdf 2 hours ago |

Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.

rbbydotdev 1 hour ago |

This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle

leecommamichael 1 hour ago |

These things don’t think. We’re going to have to reiterate this for a long time, I fear.

sheeshkebab 1 hour ago | |

…but they reason well enough given enough context (using their matmuls).

noosphr 1 hour ago | | |

To this day frontier models think that A and not B means A and B when the sentence gets pushed far enough back in their context window. The context length that model can reason over without obvious errors is much smaller than the advertised context. Between a 1/4th to a 1/20th what is advertised on the tin.

emp17344 1 hour ago | |

There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.

suprfnk 48 seconds ago | | |

I don't think they think. I still use them a lot despite that, because they are very powerful parameterised code generators.

oulipo2 38 minutes ago |

Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc

phrotoma 41 minutes ago |

"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?