1) Do you not feel self-conscious or weird about calling this "EvanFlow"? Seems like a lot of people these days are naming their AI tools/skills/whatever after themselves which seems self-absorbed. Either that or they hope that if their thing takes off like OpenClaw did then they'll grab the fame that comes along with it.
2) Why does your TDD flow miss the refactor step of TDD?
We don't question when scientists name stuff after themselves so why question this? At least he gets some recognition for his work.
2): you're right and dmitry called this out below too. shipped a fix that puts REFACTOR per-cycle, instead of being a deferred "after all tests pass" step. the old step 4 was iterate-shaped not TDD-shaped.
What does this mean?
I can think of one example that did go somewhere: Linux.
And djb (the djb) also wrote djbdns.
There are plenty of examples, usually when it coincides with someone’s first project.
Debian is a portmanteau of Debra (Ian's girlfriend) and Ian.
I don't mind it. It's just a name
Everybody who grew up to listen to Pearl Jam had seen or used an Evenflo pacifier, baby bottle, or car seat. That's one reason the song already sounded so familiar.
Sometimes it’s helpful to ask oneself what’s the benefit of an answer. I cannot think of any for your question and the way you worded it is a bit cringe. People name things after themselves all the time. It does not matter in the slightest.
“Who are you? How dare you create anything”
TDD Guard was built when Claude Code was the only one to offer hooks. Plugins didn't exist and the models were weaker, so the validation context and instructions took more work to get right. This is why it ended up requiring test reporters for different languages.
I have started a new project that does the same TDD enforcement, also through hooks, but without reporters. It works with any test runner, and it is vendor-agnostic, it works with Claude Code, Codex, and GitHub Copilot. The validator also sees recent session history which helps it handle cases like refactoring better.
The TDD instructions are still pretty basic compared to TDD Guard's, which have been dogfooded for a year. One thing I noticed while testing across agents is that some follow TDD a lot better than others, Codex struggled the most with the basic instructions.
Feedback welcome:
On jtfrench's unanswered question about dumb zone evasion: context length is what drives the drift. Agents go off-track when a loop runs long enough that early design context falls out. Resetting at each RED-GREEN-REFACTOR boundary keeps cycles short enough to avoid it. The hard cap of 5 iterate rounds is the same instinct applied at the macro level.
We ran into the parallel integration seam problem building tonone, a 23-agent Claude Code plugin where each domain agent works in its own worktree and integration tests are the merge contract.
https://github.com/tonone-ai/tonone if curious.
Built tool-call-grader to instrument exactly this. Session-level statistics across the tool-call trace plus six pathology detectors (silent failure, tool fixation, response bloat, schema drift, irrelevant response, cascading failure). On a hand-designed multi-agent benchmark, 7/7 scenarios passed — including specifically the case you're describing:
per-agent results look fine, schema-drift fires at the seam.
The detector runs over the trace, not the output. Catches the failure several turns before it shows up as "weird merge bug" the human has to debug. MIT licensed, npx-installable. Methodology in profile.It sucked so hard I thought the idea of agentic coding was just a joke. Ive tried it periodically and it literally never stopped sucking.
I figure if it cant do that part it isnt worth using it for any part.
Ever since then whenever people tell me it's gotten better I've tried it out and nope, still sucks.
I still get gaslit about how well it works by people who just discovered TDD though, and watch it power through CRUD boilerplate getting impressed, blissfully unaware that boilerplate spew is an antipattern.
> Several rules come from 2025-2026 industry research on agentic coding failure modes
What are some of the papers you read?
How are these separate steps?
TDD is how you execute, not something you tack on afterwards.
EvanFlow is a single TDD-driven loop. Say "let's evanflow this" and it walks brainstorm → plan → execute → tdd → iterate → STOP. Real checkpoints at design and plan approval. Never auto-commits, never auto-stages, never proposes integration - every git op is your call.
The three things that actually changed how I work:
1. Vertical-slice TDD. One failing test → minimal impl → next test. Watch each test fail before writing the impl that passes it. (Sounds obvious. Almost no agent does it by default. ~62% of LLM-generated test assertions are wrong per HumanEval research, so testing TDD discipline matters more than the impl discipline.)
2. Embedded grilling at decision points. Before locking a plan: what breaks if a user does X? What's the rollback? What's explicitly out of scope? Catches design flaws while they're still cheap.
3. Iterate-until-clean (hard cap of 5 rounds). Re-read the diff against dead code, naming, the deletion test, assertion correctness, and a Five Failure Modes pass (hallucinated actions, scope creep, cascading errors, context loss, tool misuse). For UI: screenshot via headless Chromium.
For bigger plans with 3+ independent units sharing types, it forks into a parallel coder/overseer orchestration. Integration tests at touchpoints ARE the cohesion contract.
Three install paths: Claude Code plugin marketplace, npx skills add, manual copy. MIT.
No x. No y. No z. Just abc.
Its like nails on a chalkboard...