Orchestrating AI code review at scale

Orchestrating AI code review at scale(blog.cloudflare.com)

84 points by pramodbiligiri 3 days ago | 33 comments

>Code review is a fantastic mechanism for catching bugs and sharing knowledge

"Sharing knowledge" is one of the first phrases in the article, and highlighted as a key benefit of code review. But the loss to human-capital from this process is never examined in the post.

> Trivial reviews (typo fixes, small doc changes) cost 20 cents on average

They did around 25,000 of these runs (about 20% of total). So CF spent $5k in the period making language models run through PRs which were <10 lines long. I get that CF engineers are paid well, but the labour cost of having an intern/entry level engineer spend ~30-60s looking through these is likely close to $0.20, and that engineer builds some human-capital while they're at it.

alain94040 32 minutes ago | |

the labour cost of having an intern/entry level engineer spend ~30-60s looking through these is likely close to $0.20

Did you do the math? Your estimate feels way off. First, I doubt an intern would process one PR in 30s. Maybe 2-3 minutes, to read 10 lines carefully looking for typos and indentation mistakes. We pay interns close to $100K these days (in a company like CloudFlare), so that's ~80c/minute. My estimate is therefore closer to $1.6 per PR. About 10X.

You are correct that there is a residual value with the intern, over time they would start learning (a little bit) about the code base.

joshuamoyers 1 hour ago |

we’ve been struggling with review throughput. this actually seems worthwhile to build at this point though i remain fairly skeptical of workflows that are agent-only, at a point it seems like the only practical solution.

we are finding lots of value in self review. its the “imagine you are doing a synchronous paired review with someone - anything that is difficult to explain, has a code smell, doesnt fit the architecture of the system around you, write a comment.” then at the end, agents do a good job of looping over PR comments.

the second thing would be a guided, educational code review tool - there are a few attempts at this, but nothing that has a good enough interface to actually stick. organize hunks by semantic importance, spend some tokens exploring the surrounding systems, showing how new code, public apis and data model flow with the existing design, and allow a human to traverse larger PRs more quickly.

thank you to cloudflare for publishing this.

ramoz 49 minutes ago | |

I do think Cloudflare probably institutes a similar manual review process as well. I have a handful of fairly vocal and supportive engineers I stay in contact with around https://plannotator.ai (there is an integrated code review surface that creates a feedback loop with your local agent).

> agents do a good job of looping over PR comments

This is the easy part. Most harnesses enable some sort of integration now, so you can actually create a smooth local experience around this as well - better code before it ships to more costly review or bloats PR threads.

> guided, educational code review tool

This is a bit tougher, and I find the main harness chat tends to work best. I learn better when I'm more engaged and aware of what I'm asking. It's easy to stick a code tour type of thing on a screen. It's hard to really nail the right attention and learning mechanism around it.

thih9 4 hours ago |

> Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents.

I’d prefer to have that happen as some sort of pre commit hook, before a merge request is sent. The feedback loop might be a bit faster and the process might produce less noise this way.

suika 1 hour ago |

As a solo dev or rather nowadays more so only a decision maker / agent overseer, I came to enjoy letting my agents develop against a Gerrit repository / workflow. Dev agent pushes a CL, review agent picks it up (not just the diff, but the full repo), runs tests/reviews/review-subagents and concludes by posting a review as well as a vote. This goes back and forth with new patch sets / replies to the threads. Eventually the CL gets a +2 or whatever and I have the final call to manually submit it. It is way slower compared to just pushing through development with one agent doing everything yolo against a normal repository, but it seems to me that the additional time is well spent (no, I don't have fancy graphs or similar analysis to prove this other than my gut feeling after looking at recent development results).

rzmmm 6 hours ago |

> The entire system also runs locally.

I think approaches like this don't need to run other than locally. Maybe integrated as pre-push hook. The system is nondeterministic, so it's at odds with the purpose of CI.

krzyk 30 minutes ago | |

It is starting the review during CI (CI just triggers the review), not blocking merges like failed build or lint failures.

proofofcontempt 5 hours ago | |

I'm not sure the people integrating it into CI process understand what CI is.

Scea91 4 hours ago | | |

Same can be said about human review if the argument is non-determinism.

fhd2 5 hours ago | |

I'd argue it's pretty much like monitoring, which certainly benefits from multiple people seeing the same stats and alerts. I agree it's at odds with CI/CD and should probably not block anything, like deterministic checks commonly do.

plmpsu 5 hours ago |

I built a more naive version for our team using Copilot and GitHub actions and it works quite well (wish I had metrics too). The team loves it.

The ROI here is so high that I don't mind using the strongest model available for the actual code review. I don't trust Sonnet and such. Just let Opus or GPT 5.5 do the whole thing and pay a bit more for less complexity.

krzyk 24 minutes ago | |

I did similarly with copilot.

I have about 15 or so subagents doing reviews from different perspectives (or providing some additional value, like finding agents.md files, doing confidence ranking, describing images attached to the PR, that get validated later on with Jira issue description).

I used it since about November, with large scale popularity in my company reaching in April - all that on a 300 premium requests (because they allowed starting subagents, and there was no limit how long a single request can last) - so it would cost something like $5000 and $8000 for April and May if it was API pricing. I had similar cost per review (about $0.90) with Opus 4.6 and help from Sonnet and Haiku for simpler tasks. It did about 4000 reviews during the last 2 months.

And starting in June, it will be dead because it will be API pricing and for $30 (or $19 since September) it will do just few reviews.

A fun project.

neebz 4 hours ago | |

do you also have separate prompts for each domain (security, architecture etc?).

would love to look into it if any part of it is open source

bob1029 5 hours ago |

> One of the operational headaches we didn’t predict was that large, advanced models like Claude Opus 4.7 or GPT-5.4 can sometimes spend quite a while thinking through a problem, and to our users this can make it look exactly like a hung job.

I had the same problem in my recursive agent harness. It would always come back, but it could sometimes take up to 10 minutes depending. I fixed this by adding a required "purpose" argument to every tool and call/return event. As the recursive evaluation proceeds, every single thing that happens streams incremental purpose text to the user's browser (also using the magic of JSONL for this). The incremental progress events contain the purpose and a detail section (tool arg JSON) that the user can expand/collapse.

derwiki 4 hours ago | |

Nice trick! I am doing something similar but passing those incremental updates to Haiku for a short user-friendly message.

jmakov 4 hours ago |

Every iteration something can be found. How many times do you iterate e.g. on performance - use optimized struct, oh, you can change the architecture etc.? At that point one can just have a while loop for the agents to make changes until no comments left.

etothet 4 hours ago |

What’s the over/under on when Cloudflare will acquire OpenCode (and keep it open source)?

faangguyindia 4 hours ago |

what's best workflow for solo devs?

criley2 1 hour ago | |

You can do basically the same thing as cloudflare except as a skill you run in your local harness. If you're going through the motions with PRs and are familiar with actions, you can have it run in a github action instead. But this is basically just a skill. The Claude code review skill is a simple version of exactly this.