Dynamic Workflows in Claude Code(claude.com) |
Dynamic Workflows in Claude Code(claude.com) |
https://blog.cloudflare.com/dynamic-workflows/
Also isn’t all of this already easy to do on any of the platforms (include Claude before this and OpenAI too).
Don't make me think.
All these knobs are also exposed in ChatGPT, which I am more familiar when chatting. Which one of the models? Do I go Instant, Thinking, Pro? Extended Pro? Oh no, maybe I need Deep Research.
Sometimes I think it's on purpose. I fear if I try a lowest knob, it will miss something. So turn everything up. And token usage goes up.
so the ai companies give us knobs and buttons and sliders to make us more comfortable.
I need more mechanisms for controlling long-running sessions and dynamically injecting my thoughts, correction, and nudges rather than faster ways to burn through my tokens without knowing if the results are going to be correct.
"Agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge."
So you will be supplying the "ground truth" (test suite, detailed spec, whatever) and empower an agent to use it to guide the other agents. Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.
Adversarial models are a longstanding technique in ML so it makes sense they would try to go this way.
Like I had an LLM implement a spec and said it was done... Except it had a ton of `casts` everywhere. Okay, my bad, I should have been clear "NO CASTS", so I use the LLM to remove the casts, except it just kept making things more and more complicated and ugly.
It took me taking a break and having a shower thought to realize all the ugliness is because one type should have been broken up into 2, which would remove a ton of generics and code. But Claude never suggested that, it was always "we need at least one cast here, or we need 1000 LOC of generic factories". I tried multiple new sessions with various prompts too.
Maybe one day soon LLMs could pay off their own slop debt but at least right now I don't trust them to write code unseen.
Edit: Maybe the correct action should have been to delete everything and make it re-write everything from scratch with the clear "NO CASTS EVER" rule. But still the point is feels like having LLM clean up after an LLM doesn't work well enough to just have keep it in a loop and never look at what it does.
Up until now I've used a review loop approach, where within a Claude Code session I just tell it to spawn three review sub-agents, each with context of what's going on and instructions to look over all of the changed code in search for serious/critical issues, but otherwise a more fresh look at things. It works really well for the most part (token usage aside): https://news.ycombinator.com/item?id=48277011
I find those to be the limiting factors to speed.
I have extensive rules, I do extensive planning. Yet at implementation, the rules are not respected, errors are introduced, etc...
I spend more time fixing than writing code.
Then speed... Because of the fixes and bad code quality even with frontiers model speed makes a very big difference. I (agents) spend hours daily doing reviews and fixes. 5x speed boost would make me much more productive.
And when working super fast with agents, having only one computer is limiting. Even worktrees don't solve problems because I use things like convex, chrome use, etc... and it conflicts with each others all the time.
Still many problems to solve. It's already evolved so much in the last two years.
Sure, ‘human in the loop’ and all that jazz, but I feel like my knowledge suffers even with this approach. I have to use llms w pinpoint focus to get decent results.
The original copilot completions behavior might be peak llm performance for coding, sans having an agent write boilerplate and such.
But each prompt will cost your company, 10 to 15 million dollars. An extra 20 million if you ask them to review the code and improve the comments.
It feels more like a bespoke build system for the specific task/project than prompting a freeform chat.
Rewriting Bun with dynamic workflows
An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust with 99.8% of the existing test suite passing, roughly 750,000 lines of Rust, and eleven days from first commit to merge. One workflow mapped the right Rust lifetime for every struct field in the Zig codebase. The next wrote every .rs file as a behavior-identical port of its .zig counterpart, hundreds of agents working in parallel with two reviewers on each file. A fix loop then drove the build and test suite until both ran clean. After the port landed, an overnight workflow addressed unnecessary data copies and opened a PR for each for final review. While not yet in production, all of this was handled by dynamic workflows. Jarred will be writing about this more in the future.
Mechanical refactors are relatively straight forward for agents.
A rewrite of bun in Rust is unlikely to be a trivial mechanical refactor. And if you are not sharing what the complicated parts were, or how big it is, how do we assess that the task was similar?
Unless you are intimately familiar with the bun codebase and you've already made that assessment.
It's telling that they used "rewrite Bun in Rust" as the proof point here. It's cool! But the vast majority of software engineering doesn't start with tens of thousands of tests, where making them pass is the whole job.
In my experience, AI still drifts from what I meant it to do on anything bigger than building a widget. My time is spent suspiciously reviewing output for changes the agent snuck in, or invariants it broke. I talked with a friend recently where the agent broke the test harness badly enough that none of the tests mattered for 3 weeks. They did pass, though, so CI never complained.
There's something at the intersection of context engineering, managing that sloppy pile of markdown plans, and good old fashioning system understanding that's the real bottleneck.
I feel like there are more efficient ways to tackle the issues given.
You can achieve a similar result manually prompting to use subagents, yes. But the TUI for in flight dynamic workflows is really nice - great visibility into exactly what's happening.
Honesty, for anything larger than a 1 shot PR, it's worth firing off a workflow for better automatic context management alone (more work done in the first 20% sweet spot)
Like 90 agents ran to do a code review of a fairly small package I have.
They're really looking for us to increase token usage aren't they?
Is this a way to increase token burn?
I thought we covered this with Claude's C compiler. What changed?
Here is the solution to it. Built on a SQLite DB and MCP, blocking until the question is answered, supporting all possible question types, with a CLI or web interface for answers, `ask_human_question` fills the gap in efficient subagent management.
I’m at the point where deciding what we should and should not do takes a lot more time than actually doing it. More agents just means running faster in potentially the wrong direction
IMO, this style of workflow/agentics is how all SWE'll look like long term. Automate everything into a big pipe-y thing. How it's gonna be modelled is up in the air though. lots of different approaches:
mine: https://github.com/portpowered/you-agent-factory
https://github.com/ComposioHQ/agent-orchestrator
I did find it uses tokens like crazy, i migrated Pixel Dungeon (java) to C# as a experiment, and it used almost 2 billion tokens. It was just 20 bucks due to deepseek flash, but i shudder thinking of how much money this uses when run on the real claude API pricing.
I did port stb_image from C to Jai which i was able to fully verify and harden and that one ill give more use. Im also using the same workflow system to perform agentic translation of a game i work with from english to various other languages, the results are far better than the commercial "human" translation services we tested. And i also use it to fix OCR issues on PDF books im ocr-ing for a data pipeline. This kind of workflow/wide agent swarm system is rather useful for many things where you want to "apply" the same prompts across a whole codebase or just in parallel.
1. Support for 1-2 OOMs more agents, to do more work in parallel
2. A phased, semi-structured approach where work happens in steps
There ya go, the rewrite was for marketing.
Is this equivalent of DAGs for sub agents inside claude code? Can i pause and resume/retry workflows? How stateful are they?
Really appreciate it someone claude code can throw more light on above. I’m trying to see if I can get langgraph equivalent DAGs here.
So far Codex /goal has been amazing but Claude Code /goal or even /loop does not work hard enough and gives up. I have observed it just claiming it’s “iterating” in a broken loop or simply giving up.
I am diffing Claude Code with them, I tend to agree with the analysis.
So far, versus my system, there are tradeoffs, but the dynamic workflows are over tuned to use way more agents that I have ever found add value.
It used 8 to diff our systems. I would have used 4, for example.
"the model sucks a bit so we just have best-of-4 & adversarial reviewing agents; surely one more agent will do the trick"
So, is this like a skill the LLM should follow, or an actual "workflow" in the deterministic sense?
If it's the former, is it even reliable for long running tasks? If it's the latter, can users interact with it?
Are we sure this is a good "success story" example?
I've had code bases with tens of thousands of lines of code built from scratch that I hand-reviewed every line of and worked with the AI to improve, and haven't had this issue. I feel like a significant part of this is due to an involved /plan stage -- going back and forth on building out a plan for what you want the AI to do involves surfacing the assumptions that you would have called drift if you asked them to implement it directly from your prompt.
Once the plan has been refined and is what I want it to be, getting it to implement everything in TDD style has for the most part given me 100% working code, as I wanted it to be, without issues. It definitely helps that I'm a principal-level engineer with extensive architectural experience -- but if you're able to tell the AI in detail what you want, have it ask questions for clarifications, and read through a plan before getting it implemented, and have a solid testing plus manual qa process (automated by chrome devtools mcp) in place, I've find that you can one-shot complex features, rewrites, and even not-insignificant applications that would have taken days to write by hand in a few hours.
Still Claude will sneak things in - in my recent plan, for example I had defined, per acceptance criteria what colours the statuses should be: green for live, blue for sold, grey for anything else; it changed this to: green for live, orange for in progress, blue for sold, red in demolition, etc. When pressed why did it to this, it was unable to explain why. This is with a plan where AC were explicitly provided from the task in Given/When/Then format and were to be adhered to strictly. I've caught this within planning, but I shouldn't need to be doing this.
Even in standard prompts where I tell it "Change this label from X to Y", it ended reordering the tabs unrelated to ask. Again I was not able for it to explain why - it was so abrupt. And it was in fresh context, without any pollution on what I expect it to do.
I also noticed a different behaviour regarding skill; today and yesterday it would not be following skill guidance at all ie: skill writing skill - I'd have to explicitly tell it to test skills after writing them, when this is a behaviour expected by default. Similarly with other skills - knowing that it should have done something per skill guidelines and it not doing it at all. This is new behaviour that I've not seen a week ago.
Using the keyword “Workflow”like “Ultrathink” is problematic?
Ultrathink is uncommon enough that it is unlikely to be used in code or prompt outside its intended purpose.
Workflow is generic keyword and used in so many contexts both inside the codebase and orchestration tooling like say temporal.io or others that name their constructs “workflows”.
Why'd you guys not want to allow the traceparent in hooks, but allowed the session.id? Any plans on changing that?
do you think something like a /speed config can be introduced to adjust agent working speed and let people adjust?
Maybe blasphemy, but will workflows be able to use non-Anthropic LLMs (e.g., delegating some steps to local models, but design and review by Claude)?
In my experiments I've had some success modeling the work to be done as a DAG of typed artifacts with a combination of code + LLM doing decomposition, transforms, synthesis, and fitness checking to generate the output. It took me a lot of tries to arrive at that formula and it would be cool to have something more general. I also run part of it against local compute because it would be far beyond my budget to do it all on Opus, so something for that would be nice too.
Is there an example of how y'all use Dynamic Workflows internally that you could share with the rest of us here so that we can mimic something similar?
1. Autonomously landed 20+ optimizations to reduce Claude Code's token usage by ~15%
2. Ported tree-sitter, color-diff, yoga-layout, and a number of other WASM and Rust native modules to TypeScript, improving CPU and memory use by 2-10x in the process
3. Made our CI faster, and repeatedly found and fixed flaky tests (with /loop)
4. Migrated from regex-based bash static analysis to tree-sitter, reducing false positive permission prompts by 45%
5. Reduced Claude Agent SDK startup time by 61%, by repeatedly profiling and optimizing the startup path, putting up a number of PRs in the process
6. Shipped 69 code simplification PRs, deleting >10k lines of code
thx for all that amazing tec and save ai
and got
API Error: 400 messages.3.content.11: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
Tried again in
Claude 1.9659.1 (193bcb) 2026-05-28T16:22:15.000Z also but may need a new chat
Its like you guys aren't even aware of the primary problem you are all facing: your token burns aren't paying off anyore against standard coding -- and looking net negative. I have to ask, are you this unaware of your core problem set here?
There are no any examples, proofs, or scenarios that show why there is improvement either in complexity or reliability of the solution or effeciency to the path of the solution. I'm baffled.
I've had to put a fair chunk of effort in to skills that will run deterministic mechanisms to unslop a codebase (cyclomatic complexity grading has been really helpful here) as invariably some amount of guidance around principles will be missed over time. I've found it does help, though. Certainly I'm getting overall better results from Flash and Sonnet over multiple runs for fairly modest token increases. GPT 5.5 less so, but that's because it scores better in a first pass. I won't really know until I gauge it at the end of my sub month which has been more cost efficient for me all things considered.
Webdev here, but currently I have: - a skill where I outlined how the architecture of the system should look like, with guards (static analysis, architecture tests, linting) confirming that the code it generates adheres to standards
- a skill that tells it how tests should look like (use generators, write both feature / unit tests)
- a skill that tells it to generate docs from the code in a form of acceptance criteria (Given / When / Then)
- a skill that tells it to generate frontend uat tests + accompanying backend seeders given the AC
- a skill that tells it to verify that ticket objectives match what was delivered
At this point I still need to guide it to move task from one stage to the other (coding, testing, verification that indeed what was coded adheres to what was required), but I believe that these dynamic workflows can automate this work as well.
To me, it seems the models are inherently designed to do this. Creating more verbose output than input, generating plans introduce things I didn’t ask for, extras, more “defensive” code that makes sense at first but is completely unnecessary in practice… I find it exhausting, but it’s important to pare down the output / plans at each stage and trim the generated stuff that isn’t needed.
I don’t see them fixing this any time soon, and thus human in the loop is a requirement to use these tools effectively. That is unless you love your slot machine dopamine rush enough to ignore quality gates and respect for your peers time.
The pure LLM, no human intervention vibe-coded PRs on Bun since the vibe-rewrite to Rust contain the worst coding horrors I've seen in 20 years of programming.
Setting aside the quality of the change itself (I would have done it differently, for sure: it is pretty straightforward to build a safe abstraction out of this type), the utterly pointless "source-text consistency test" added here is easily the worst example of "test repeats implementation" I have seen in my career:
https://github.com/oven-sh/bun/pull/30728/files#diff-863477b...
The current baseline workflow is something like agent output -> human review -> agent refinement -> human review -> agent refinement -> ...
But agents are capable of making meaningful improvements to their own output. I'm hoping dynamic workflows move towards something like:
agent output -> agent review -> agent refinement -> (cycle to fixed point) -> final human review
An agent can help you create the specification, but it's up to you to know whether it's correctly testing that you got the result you wanted.
Curious to learn more on this (unless there’s a write-up in the works). I’m naive on this matter but:
1. is this because it’s higher cost when passing objects back and forth across the JS/native boundary? 2. Does this have anything more specific to do with use of Bun? 3. is the stance for claude code then to keep all the deps in raw TypeScript? 4. How do you folks keep these ported deps up-to-date?
This reads like a CV, not trying to help or educate.
Eg is Zed capable of using a Claude Code Subscription?
Yes. Zed connects to Claude Code via ACP.
As usual though it's not super clear exactly what is allowed or not.