Hi HN, Over the past two years I’ve built and debugged a fair number of production pipelines—mainly retrieval‑augmented generation stacks, agent frameworks, and multi‑step reasoning services. A pattern emerged: most incidents weren’t outright crashes, but silent structural faults that slowly compromised relevance, accuracy, or stability. I began logging every recurring fault in a shared notebook. Colleagues started using the list for post‑mortems, so I turned it into a small public reference: 16 distinct failure modes (semantic drift after chunking, embedding/meaning mismatches, cross‑session memory gaps, recursion traps, etc.). The taxonomy isn’t academic; each item references a real outage or mis‑prediction we had to fix. Why share it? Common vocabulary – naming a failure mode makes root‑cause discussions faster and less hand‑wavy. Earlier detection – several teams now check new features against the list before shipping. Community feedback – if something is missing or misclassified, I’d rather learn it here than during another 3 a.m. incident. The reference has already helped a few startups (and my own projects) avoid hours of trial‑and‑error. If you work on LLM infrastructure, you might find a familiar bug—or a new one to watch for. The link to the full table and brief write‑ups is in the “url” field of this Show HN post. I’m not selling anything; it’s MIT‑licensed text. Comments, critiques, or additional failure patterns are very welcome. Thanks for taking a look. |