Orchestrate teams of Claude Code sessions(code.claude.com) |
Orchestrate teams of Claude Code sessions(code.claude.com) |
> I went to senior folks at companies like Temporal and Anthropic, telling them they should build an agent orchestrator, that Claude Code is just a building block, and it’s going to be all about AI workflows and “Kubernetes for agents”. I went up onstage at multiple events and described my vision for the orchestrator. I went everywhere, to everyone. (from "Welcome to Gas Town" https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...)
That Anthropic releases Agent Teams now (as rumored a couple of weeks back), after they've already adopted a tiny bit of beads in form of Tasks) means that either they've been building them already back when Steve pitched orchestrators or they've decided that he's been right and it's time to scale the agents. Or they've arrived at the same conclusions independently -- it won't matter in the larger scale of things. I think Steve greately appreciates it existing; if anything, this is a validation of his vision. We'll probably be herding polecats in a couple of months officially.
The main claude instance is instructed to launch as many ralph loops as it wants, in screen sessions. It is told to sleep for a certain amount of time to periodically keep track of their progress.
It worked reasonably well, but I don't prefer this way of working... yet. Right now I can't write spec (or meta-spec) files quick enough to saturate the agent loops, and I can't QA their output well enough... mostly a me thing, i guess?
Same for me, however, the velocity of the whole field is astonishing and things change as we get used to them. We are not talking that much about hallucinating anymore, just 4-5 months ago you couldn't trust coding agents with extracting functionality to a separate file without typos, now splitting Git commits works almost without a hinch. The more we get used to agents getting certain things right 100% of the time, the more we'll trust them. There are many many things that I know I won't get right, but I'm absolutely sure my agent will. As soon as we start trusting e.g. a QA agent to do his job, our "project management" velocity will increase too.
Interestingly enough, the infamous "bowling score card" text on how XP works, has demonstrated inherently agentic behaviour in more way than one (they just didn't know what "extreme" was back then). You were supposed to implement a failing test and then implement just enough functionality for this test to not fail anymore, even if the intended functionality was broader -- which is exactly what agents reliably do in a loop. Also, you were supposed to be pair-driving a single machine, which has been incomprehensible to me for almost decades -- after all, every person has their own shortcuts, hardware, IDEs, window managers and what not. Turns out, all you need is a centralized server running a "team manager agent" and multiple developers talking to him to craft software fast (see tmux requirement in Gas Town).
The fact that Anthropic and OpenAI have been going on this long without such orchestration, considering the unavoidable issues of context windows and unreliable self-validation, without matching the basic system maturity you get from a default Akka installation shows us that these leading LLM providers (with more money, tokens, deals, access, and better employees than any of us), are learning in real time. Big chunks of the next gen hype machine wunder-agents are fully realizable with cron and basic actor based scripting. Deterministically, write once run forever, no subscription needed.
Kubernetes for agents is, speaking as a krappy kubernetes admin, not some leap, it’s how I’ve been wiring my local doom-coding agents together. I have a hypothesis that people at Google (who are pretty ok with kubernetes and maybe some LLM stuff), have been there for a minute too.
Good to see them building this out, excited to see whether LLM cluster failures multiply (like repeating bad photocopies), or nullify (“sorry Dave, but we’re not going to help build another Facebook, we’re not supposed to harm humanity and also PHP, so… no.”).
I remember having conversations about this when the first ChatGPT launched and I don’t work at an AI company.
Like, who cares? Judging from his blog recount of this it doesn't seem like anybody actually does. He's an unnecessarily loud and enthused engineer inserting himself into AI conversations instead of just playing office politics to join the AI automation effort inside of a big corporation?
"wow he was yelling about agent orchestration in March 2025", I was about 5 months behind him, the company I was working for had its now seemingly obligatory "oh fuck, hackathon" back in August 2025
and we all came to the same conclusions. conferences had everyone having the same conclusion, I went to the local AWS Invent, all the panels from AWS employees and Developer Relations guys were about that
it stands to reason that any company working on foundational models and an agentic coding framework would also have talent thinking about that sooner than the rest of us
so why does Yegge want all of this attention and think its important at all, it seems like it would have been a waste of energy to bother with, like in advance everything should have been able to know that. "Anthropic! what are you doing! listen to meeeehhhh let me innnn!"
doesn't make sense, and gastown's branding is further unhinged goofiness
yeah I can't really play the attribution games on this one, can't really get behind who cares. I'm glad its available in a more benign format now
... the "limit" were agents were not as smart then, context window was much smaller and RLVR wasn't a thing so agents were trained for just function calling, but not agent calling/coordination.
we have been doing it since then, the difference really is that the models have gotten really smart and good to handle it.
But this shows how much stuff is still to do in the ai space
Haven't tried Kimi, hear good things.
At least, my M1 Pro seems to struggle and take forever using them via Ollama.
I'm burning through so many tokens on Cursor that I've had to upgrade to Ultra recently - and i'm convinced they're tweaking the burn rate behind the scenes - usage allowance doesn't seem proportional.
Thank god the open source/local LLM world isn't far behind.
Are you spending more than $150k per year on AI?
(Also, you're talking about the cost of your Cursor subscription, when the article is about Claude Code. Maybe try Claude Max instead?)
Wonder how they compare?
No polecats smh
I love that we are in this world where the crazy mad scientists are out there showing the way that the rest of us will end up at, but ahead of time and a bit rough around the edges, because all of this is so new and unprecedented. Watching these wholly new abstractions be discovered and converged upon in real time is the most exciting thing I've seen in my career.
Though I do hope the generated code will end up being better than what we have right now. It mustn't get much worse. Can't afford all that RAM.
It's just HN that's full of "I hate AI" or wrong contrarian types who refuse to acknowledge this. They will fail to reap what they didn't sow and will starve in this brave new world.
This new orchestration feature makes it much more useful since they share a common task list and the main agent coordinates across them.
[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
This seems handled by this new agent which is cool.
I gave up on worktrees and hacked together a solution with fine-grained lockfiles for editing, running builds, etc that worked surprisingly good for what it was
We cannot allow model providers to own the browsers, CLIs, memory, IDEs, extensions and other tooling. Its not just a matter of power but also they just suck at it as i experience every time i have to use claude code instead of amp.
I truly hope we get the pattern of innovation that looks like:
- some dude vibecodes a really cool idea
- model providers build into their reference implementations
- model providers optimize models to work optimally
- startup and/or open source projects step in and build something that is actually usable and opens a new market segment
We saw this play out beautifully with amp, kilo, roo, cline, continue
Another aspect is that we do not want interfaces just made for agents to work in teams, we want software made for humans and agents, that are true platforms for these agent teams to collaborate in.
Why do agents need to speak to each other if they’re just doing the work correctly the first time?
Is it an admission that a single agent is not useful and reliable enough?
I've switched this over to a team of 4 now that talk to each other to discuss issues they find and it's amazing. They confirm between themselves and if they wrongly identified something the others correct them.
I understand that it works better, but I am rightfully pointing out that it's less efficient.
An analogy would be putting a V8 engine into a pickup truck to make it go as fast as a Mazda Miata.
Assign roles to different models and have them coordinate: Claude as the lead, Codex on backend, Gemini on frontend, etc.
I wrote about my experiences with multi-agent orchestration here: https://x.com/khaliqgant/status/2019124627860050109?s=46
Meanwhile, the same issues that have plagued these tools since their inception are largely ignored: hallucination, innacuracy, context collapse, etc. These won't be solved by engineering, but by new research and foundational improvements.
On one hand, solid engineering was sorely needed, and can extract a lot of value from the current tech. But on the other, all these announcements and improvements feel like companies grasping at straws to keep the hype cycle going by any means necessary. Charts must go up and to the right, or investors get antsy.
It's all adding to the mountain of signs that suggest that this isn't the path to artificial intelligence. It's interesting tech, with possibly many valuable applications, but the "AI" narrative is frankly tiring. I wish I could fast forward on this speculative phase, go past the inevitable crash, and arrive at a timeframe where we've figured out what this tech is actually good for, and where we hopefully use it more for good than evil.
(i thought gas town was satire? people in comments here seem to be saying that gas town also had multi-agent file sharing for work tracking)
I guarantee you that price will double by 2027. Then it’ll be a new car payment!
I’m really not saying this to be snarky, I’m saying this to point out that we’re really already in the enshittification phase before the rapid growth phase has even ended. You’re paying $200 and acting like that’s a cheap SaaS product for an individual.
I pay less for Autocad products!
This whole product release is about maximizing your bill, not maximizing your productivity.
I don’t need agents to talk to each other. I need one agent to do the job right.
Then, in your prompt you tell it the task you want, then you say, supervise the implementation with a sub agent that follows the architecture skill. Evaluate any proposed changes.
There are people who maximize this, and this is how you get things like teams. You make agents for planning, design, qa, product, engineering, review, release management, etc. and you get them to operate and coordinate to produce an outcome.
That's what this is supposed to be, encoded as a feature instead of a best practice.
This sounds more like an automation of that idea than just N-times the work.
Just ask claude to write a plan and review/edit it yourself. Add success criteria/tests for better results.
You run out of context so quickly and if you don’t have some kind of persistent guidance things go south
```
Rules:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
For all moves, follow the standard Tower of Hanoi procedure: If the previous move did not move disk 1, move disk 1 clockwise one peg (0 -> 1 -> 2 -> 0).
If the previous move did move disk 1, make the only legal move that does not involve moving disk1.
Use these clear steps to find the next move given the previous move and current state.
Previous move: {previous_move} Current State: {current_state} Based on the previous move and current state, find the single next move that follows the procedure and the resulting next state.
```
This is buried down in the appendix while the main paper is full of agentic swarms this and millions of agents that and plenty of fancy math symbols and graphs. Maybe there is more to it, but the fact that they decided to publish with such a trivial task which could be much more easily accomplished by having an llm write a simple python script is concerning.
this does eat up tokens _very_ quickly though :(
I don't need anything more complicated than that and it works fine - also run greptile[1] on PR's
https://github.com/FredericMN/Coder-Codex-Gemini https://github.com/fengshao1227/ccg-workflow
This one also seems promising, but I haven't tried it yet.
https://github.com/bfly123/claude_code_bridge
All of them are made by Chinese dev. I know some people are hesitant when they see Chinese products, so I'll address that first. But I have tried all of them, and they have all been great.
https://www.augmentcode.com/product/intent
can use the code AUGGIE to skip the queue. Bring your own agent (powered by codex, CC, etc) coming to it next week.
1. GPT-5.2 Codex Max for planning
2. Opus 4.5 for implementation
3. Gemini for reviews
It’s easy to swap models or change responsibilities. Doc and steps here: https://github.com/sathish316/pied-piper/blob/main/docs/play...
The key is streaming NDJSON output to track cost per iteration and detect completion markers. The human stays in control by editing CLAUDE.md between runs to steer the project.
This would also be true of Junior Engineers. Do you find them impossible to work with as well?
As usual, the hard part is the actual doing and producing a usable product.
The truth is that people are doing experiments on most of this stuff, and a lot of them are even writing about it, but most of the time you don't see that writing (or the projects that get made) unless someone with an audience already (like Steve Yegge) makes it.
Also, because they are stuck in a language and an ecosystem that cannot reliably build supervisors, hierarchies of processes etc. You need Erlang/Elixir for that. Or similar implementations like Akka that they mention.
[1] Yes, they claim their AI-written slop in Claude Code is "a tiny game engine" that takes 16ms to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
Often times if I'm only working on a single project or focus, then I'm not using most of those roles at all and it's as you describe, one agent divvying out tasks to other agents and compiling reports about them. But due to the fact that my velocity with this type of coding is now based on how fast I can tell that agent what I want, I'm often working on 3 or 4 projects simultaneously, and Gas Town provides the perfect orchestration framework for doing this.
Ideally you could eventually remove the agentic supervisor. But for some cases you would want to keep it around, or at least a smaller model which suffices.
So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always
The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.
and incantation you put on your resume to double your salary for a few months before the company you jumped ship to gets obsoleted by the foundational model
At work tho we use Claude Code thru a proxy that uses the model hosted on AWS bedrock. It’s slower than consumer direct-to-Anthropic and you have to wait a bit for the latest models (Opus 4.5 took a while to get), but if our stats are to be believed it’s much much cheaper.
But it continually, wildly performs slower and falls short every time I’ve tried.
If it falls short every time you've tried, it's likely that one or more of these is true:A. You're working on some really deep thing that only world-class expects can do, like optimizing graphics engines for AAA games.
B. You're using a language that isn't in the top ~10 most popular in AI models' training sets.
C. You have an opportunity to improve your ability to use the tools effectively.
How many hours have you spent using Claude Code?
Not exactly world-class software.
Using these tools takes quite a bit of effort but even after doing all those steps to use the tool well, I still got this project done in a few days when it otherwise would have taken me 1-2 months and likely simply would never happened at all.
And whether you have a decent PRD or spec. Are you trying to prompt the harness with one bit at a time, or did you give it a complete spec and ask it to analyze it and break it down into individual issues with dependencies (e.g. using beads and beads_viewer)?
I'm not looking for reasons to criticize your approach or question your experience, but your answers may point to opportunities for you to get more out of these tools.
If you're using Claude Code and you have a friend who has had more success with these tools, consider exporting your transcripts and letting them have a look: https://simonwillison.net/2025/Dec/25/claude-code-transcript...
This is a relatively common skill. One thing I always notice about the video game industry is it's much more globally distributed than the rest of the software industry.
Being bad at writing software is Japan's whole thing but they still make optimized video games.
The issues I ran into are primarily “tail-chasing” ones - it gets into some attractor that doesn’t suit the test case and fails to find its way out. I re-benchmark every few months, but so far none of the frontier models have been able to make changes that have solved the issue without bloating the codebase and failing the perf tests.
It’s fine for some boilerplate dedup or spinning up some web api or whatever, but it’s still not suitable for serious work.
It's insulting that criticism is often met with superficial excuses and insinuation that the user lacks the required skills.
https://mitchellh.com/writing/my-ai-adoption-journey
My experience mirrors that of Mitchell. It absolutely is at the level now where AI can free up time to do the really interesting stuff.
GP said 'falls short every time I’ve tried'. Note the word 'every'.
Claude would be worse than an expert at this, but this is a benchmarkable task. Claude can do experiments a lot quicker than a human can. The hard part would be ensure that the results aren't just gaming your benchmark.
I feel like comparison just to a junior developer is also becoming a fairly outdated comparison. Yes, it is worse in some ways, but also VASTLY superior in others.
I know this was last year but...
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
If I pay $3k/month to a developer and a $200/month tool makes them 10% more productive I will pay it without thinking.
If you’re not able to get US$thousands out of these models right now either your expectations are too high or your usage is too low, but as a small business owner and part/most-time SWE, the pricing is a rounding error on value delivered.
But as an individual with no profit motive, no way.
I use these products at work, but not as much personally because of the bill. And even if I decided I wanted to pursue a for profit side project I’d have to validate it’s viability before even considering a 200$ monthly subscription
This did require some amount of effort on my part, to test and iterate and so on, but much less than if I needed to write all the code myself. And, because these programs are for personal use, I don't need to review all the code, I don't have security concerns and so on.
$100 every month for a service that writes me custom applications... I don't know, maybe I'm being stupid with my money, but at the moment it feels well worth the price.
- $20 for Claude Pro (Claude Code) - $20 for ChatGPT Plus (Codex) - Amp Free Plan (with ads and you get about $10 of daily value)
So you get to use 3 of the top coding agents for $40 month.
with the US salaries for SWEs $1000/month is not a rounding error for all but definitely for some. say you make $100/hr and CC saves you say 30hrs / month? not rounding error but no brainer. if you make $200+/hr it starts to become a rounding error. I have multiple max accounts at my disposal and at this point would for sure pay $1000/month for max plan. it comes down to simple math
1. 1-3 LLM vendors are substantially higher quality than other vendors and none of those are open source. This is an oligarchy and the scenario you described will play out.
2. >3 LLM vendors are all high quality and suitable for the tasks. At least one of these is open source. This is the "commodity" scenario, and we'll end up paying roughly the cost of inference. This still might be hundreds per month, though.
3. Somewhere in between. We've got >3 vendors, but 1-3 of them are somewhat better than the others, so the leaders can charge more. But not as much more than they can in scenario #1.
The only place frontier labs will be able to profit take is niche models for specific purposes where they can control who has access to traces tightly. Any general pupose LLM with highly available traces is gonna get distilled down instantly.
Traditional SaaS products don't write code for me. They also cost much less to run.
I'm having a lot of trouble seeing this as enshittification. I'm not saying it won't happen some day, but I don't think we're there. $200 per month is a lot, but it depends on what you're getting. In this case, I'm getting a service that writes code for me on demand.
The enshittification is that the costs are going up faster than inflation and companies like OpenAI are talking about adding advertisements.
https://www.fintechweekly.com/magazine/articles/cursor-prici...
https://hostbor.com/claude-ai-max-plan-explained/
We can see especially in the case of Claude AI Max that while it sounds like you’re getting better value than the cheaper plans, the company is now encouraging less efficient use of the tool (having multiple agents talking to each other, rather than improving models so that one agent is doing work correctly).
Eh, I'd call those a sort of programming language. The user is still writing code, albeit in a "friendlier" manner. You can't just ask for what you want in English.
> The enshittification is that the costs are going up faster than inflation and companies like OpenAI are talking about adding advertisements.
In 1980, IT would have cost $0 at most companies. It's okay for costs to go up if you're getting a service you were not getting before.
Autodesk Fusion for manufacturing costs less than Claude Max and you literally can’t do your job without it.
So Autodesk takes you from 0 to 100% productivity for under $200 a month and companies are expected to pay $200+ to gain an extra 10-20%?
That math isn’t how it works with any other business logic tools.
Yesterday I fed claude very surgical instructions on how the bug happens, and what I want to happen instead, and it oneshot the fix. I had a solution in about 5 minutes, whereas it would have taken me at least an hour, but most likely more time to get to that point.
Literally an hour or two of my day was saved yesterday. I am salaried at around $250/hour, so in that one interaction AI saved my employer $250-500 in wages.
AI allows me to be a T shaped developer, I have over a decade of deep experience in infrastructure, but know fuck all about front end stuff. But having access to AI allows me as an individual who generally knows how computers work to fix a simple problem which is not in my domain.
My process, which probably wouldn't work with concurrent agents because I'm keeping an eye on it, is basically:
- "Read these files and write some documentation on how they work - put the documentation in the docs folder" (putting relevant files into the context and giving it something to refer to later on)
- "We need to make change X, give me some options on how to do it" (making it plan based on that context)
- "I like option 2 - but we also need to take account of Y - look at these other files and give me some more options" (make sure it hasn't missed anything important)
- "Revised option 4 is great - write a detailed to-do list in the docs/tasks folder" (I choose the actual design, instead of blindly accepting what it proposes)
- I read the to-do list and get it rewritten if there's anything I'm not happy with
- I clear the context window
- "Read the document in the docs folder and then this to-do list in the docs/tasks folder - then start on phase 1"
- I watch what it's doing and stop if it goes off on one (rare, because the context window should be almost empty)
- Once done, I give the git diffs a quick review - mainly the tests to make sure it's checking the right things
- Then I give it feedback and ask it to fix the bits I'm not happy with
- Finally commit, clear context and repeat until all phases are done
Most of the time this works really well.
Yesterday I gave it a deep task, that touched many aspects of the app. This was a Rails app with a comprehensive test suite - so it had lots of example code to read, plus it could give itself definite end points (they often don't know when to stop). I estimated it would take me 3-4 days for me complete the feature by hand. It made a right mess of the UI but it completed the task in about 6 hours, and I spent another 2 hours tidying it up and making it consistent with the visuals elsewhere (the logic and back-end code was fine).
So either my original estimate is way off, or it has saved me a good amount of time there.
“The research is wrong.”
It's outdated, doesn't differentiate between people trying to incorporate it in their current workflow and the people who apply themselves to entirely new ones. It doesn't represent me in any way and I am releasing features to my platform daily now, instead of weekly. So I can wholeheartedly disagree with its conclusion.
The earth is either flat of it isn't. It's easy to proof it's not flat. It's not easy to conclude that the results of a study in a field that changes daily represents all people working in it, including the ones who did not participate.
The reason we don’t see any other research is because it’s neigh impossible to study a moving field. Especially at this pace.
If you have any ideas on how to measure objectively while this landscape changes daily, please share them with us. Maybe a researcher will jump on this bandwagon and proof you right.