AI Code Review Gets Better When I Ask Models to Debate: Claude, Gemini, Codex

AI Code Review Gets Better When I Ask Models to Debate: Claude, Gemini, Codex(milvus.io)

23 points by Fendy 126 days ago | 3 comments

7777777phil 126 days ago |

Makes sense. This also tracks with the research on human-AI collaboration. A single model converges to the mean of its training distribution, but adversarial multi-model setups break that pattern because each model's blind spots are different.

I wrote about why single-model AI has a structural quality ceiling and why ensemble/hybrid approaches consistently outperform: https://philippdubach.com/posts/the-impossible-backhand/

itmitica 126 days ago |

I did the exact same thing! Uncanny.

I agree with models being better at different tasks: gemini-cli is superficial, codex is stubborn as a mule and dependable, claude-cli just wants to get something working and done. qwen-cli, Qwen, in general, has a tendency to pendulate too much.

I also reduced the team to two, codex and claude, for me.

rbliss 126 days ago | |

Agree with this. I have Codex do analysis and feedback for Claude code. For whatever reason, Claude code seems to produce successful code more frequently, but it tends to have blind spots in performing analysis that Codex does a good job of picking up. The two together feel like a step up in state of the art.

I need a tool to put them in a loop together to get this done more efficiently…I guess I’ll plug this in as a prompt and go from there!