Learnings from 100K lines of Rust with AI (2025)

Learnings from 100K lines of Rust with AI (2025)(zfhuang99.github.io)

126 points by pramodbiligiri 14 hours ago | 131 comments

chadd 12 hours ago |

We're working on a large Rust codebase, heavily assisted development with Claude and Codex, and one critical workflow is after you have written a spec, have the other LLM critique it thoroughly.

This back and forth will take quite a while, but the resulting implementation plan will be 10x better than the original.

You can automate this by giving Codex a goal, and a skill to call Claude to review the implementation spec until they both agree it's done.

Then, for critical code, have them both implement the spec in a worktree, then BOTH critique each other's implementation.

More often than not, Claude will say to take 2 or 3 pieces from it's design over to Codex, but ship the Codex implementation.

Aurornis 11 hours ago | |

I take this idea even further: After the LLMs have critiqued each other, I introduce a third critique and review it myself as a human. This third party review is most effective at highlighting problems that the LLMs miss, in my experience.

Jokes aside, I agree about having LLMs iterate. Bouncing between GPT and Opus is good in my experience, but even having the same LLM review its own output in a new session started fresh without context will surface a lot of problems.

This process takes a lot of tokens and a lot of time, which is find because I’m reviewing and editing everything myself during that time.

knivets 10 hours ago | |

This is astrology for devs.

embedding-shape 7 hours ago | | |

Unless you can somehow provide some arguments against it, I feel like you're the one who is trying to cargo-cult stuff here.

Say what you will with proper reasoning or arguments if you feel compelled, tired reddit-commentary like that helps no one.

giancarlostoro 10 hours ago | |

This is precisely how I used to use Beads before I made GuardRails (I wanted something slightly simpler, but similar with more 'guard rails'). I braindump everything I want to build, I ask Claude to do market level research. I then ask Claude to ask clarifying questions, when I ask Claude to be critical of its conclusions and provide the top options and to justify it. I also question Claude and say its okay to disagree with me, be critical, I just want to understand.

By the end you have piecemeal "tickets" for your coding agent, if you have multiple developers you can sync them all up into github, and someone could take some locally, or you can just have Claude work on all of them with subagents. The key feature there is because its all piecemeal the context stays per task.

Then I run a /loop 15m If you're currently working ignore this. Start on the next task in gur if you have not. If you finished all work and cannot pass one gate, work on the next available task.

(Note: gur is my shorthand for GuardRails)

I also added a concept called "gates" so a task cannot complete without an attached gate, gates are arbitrary, they can be reused but when assigned to a task those specific assignments are unique per task. A task is basically anything you want it to be: unit test, try building it, or even seek human confirmation. At least when I was using Beads it did not have "gates" but I'm not sure if it has added anything like it since I stopped using Beads.

Claude will ignore the loop if it's currently working, and when its "out of work" it will review all available tasks.

If anyone's curious its MIT Licensed and on GitHub:

https://github.com/Giancarlos/guardrails

motoboi 11 hours ago | |

I strongly believe you don’t need to call another model for that. The same model can do result fine. Just not as part of the same context.

I mean that if you ask codex on gpt 5.5 to submit to a plan reviewer subagent that uses gpt5.5, this is enough to have a very good reviewing and reassessment of the plan.

My hypothesis is that it’s even better than opus.

The reason why submitting the product of one LLM to another to review is that you need a fresh trajectory. The previous context might have “guided” the planer into some bias. Removing the context is enough to break free from that trajectory and start fresh.

ai_fry_ur_brain 11 hours ago | |

I hate how seriously people take the output of an LLMs or how reliable they think it is.

Have Claude produce that spec 10 times, use the same prompt and same context. Identical requests, but you'll get 10 unique answers that wil contradict each other with each response seeming extermely confident.

Its scary how confident you people are in these outputs.

CrazyStat 11 hours ago | | |

If you ask 10 different humans to produce the spec with the same information (prompt and context) they will also produce 10 unique answers that will contradict each other and (depending on who you asked) may be just as confident.

There are real decisions to be made when going from a vague prompt to a spec. It's not surprising that an LLM would produce different specs for the same work on different runs. If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.

Robdel12 11 hours ago | | |

Imagine making this your entire identity

AnimalMuppet 11 hours ago | |

The return of pair programming.

slopinthebag 6 hours ago | |

It's incredible how much developers will do to avoid having to look at or think about code.

torben-friis 13 hours ago |

>Testing is the first layer of defense. My system now includes 1,300+ tests — from unit tests to minimal integration tests (e.g., proposer + acceptor only), all the way to multi-replica full integration tests with injected failures. See the project status.

I know LOC is a silly metric, but ~1300 tests for 130k lines averages out to a test per 100 lines - isn't this awfully low for a highly complex piece of code, even discounting the fact that it's vibecoded? 100 LOC can carry a lot of logic for a single test, even for just happy paths.

embedding-shape 12 hours ago | |

Considering the domain being distributed systems, and aiming to implement "a Rust-based multi-Paxos consensus engine that not only implements all the features of Azure’s Replicated State Library (RSL)", I don't think we even have to look so deep into it, it's severely lacking tests.

If you're building a distributed system and you don't have more tests and testing code than actual code, by an order of magnitude most likely, then you're missing test coverage.

kawogi 13 hours ago | |

IIUC only 50k LoC are non-test code, which improves the metric. Whether that's enough tests still depends on the code. If most are getters and setters, the coverage might be ok.

risyachka 13 hours ago | |

I may have missed it but are those tests written by person or generated? Otherwise how do you know they even test anything (like actually test, not appear to test)

jdw64 13 hours ago |

I'm also shifting to an vibe coding workflow, but I have a genuine question: whenever I use AI for Rust, it makes an insane amount of lifetime errors. I have no idea how people are churning out so many lines of code so quickly.

Honestly, despite all the hype around Rust in the community, the fact that AI can't handle lifetimes reliably makes me reluctant to use it. The AI constantly defaults to spamming .clone() or wrapping things in Rc, completely butchering idiomatic Rust and making the output a pain to work with.

On the other hand, it writes higher-level languages better than I do. For those succeeding with it, how exactly are you configuring or prompting the AI to actually write good, idiomatic Rust

icemanx 13 hours ago |

How many of those tests have you actually read yourself if all of them are generated by AI (also when you're sleeping) ?

This is from 2025 - I would like to see an update now how that system turned out to be after the vibe hype

ramon156 12 hours ago | |

I feel like there's very little blogs that actually follow up on their experiment. It's just dopamine city.

jsLavaGoat 6 hours ago |

I am having a different experience than a lot of other commenters here vibe coding with Rust. I am not a Rust programmer or evangelist. I have implemented a drop-in Bash replacement/clone in Rust that passes the upstream Bash test suite and a whole battery of its own. It is a tiny bit faster than Bash itself but consumes a bit more memory. But Codex and Claude both did a great job with it.

I also had it implement a wasm geodesic calculator in Rust and it's amazing and in my use case is better than geodesiclib using the same updated algorithm.

I'm a "C-nile" Rust folks love to hate and did my first hacking in C Deep Blue C on Atari 8-bits. But I'm very impressed with these products and with the ability to leverage some features of Rust with them. (e.g. audit every unsafe instance and define its invariants, etc.)

I also agree with the commenter who said these LLMs are today, at the present moment, good at Go. The only language I notice it seems to be really good above and beyond others at is javascript, I assume because there's so much of it.

misja111 11 hours ago |

To me, the real question after reading this, is: Is your new implementation of Azure’s RSL now being used?

If it is, and it works well, then to me this is far more meaningful than the fact that AI wrote 130K lines of code.

staszewski 13 hours ago |

It's almost guaranteed with agents you could do the same job with less than half of 100k lines. I don't know whats impressive in lines of code generated by agent.

ndr 13 hours ago | |

It just an anchor. If it were 50k would you say the same down to 25k? And if so how many more times would it apply?

The interesting thing is that it was manageable solo (in many ways it's _more_ manageable solo+AIs than with coworkers+(their)AIs), and in such a short amount of time.

kikimora 12 hours ago | | |

Original RSL library is 36k LoC. And this is C++. Rust should be like 50% smaller, that is, 18k LoC. This library is so big that I bet the author has no idea if it works or not. 1300 test generated by AI say nothing about actual quality.

In the end it is just a lot of unmaintainable code quickly generated by AI.

rimliu 12 hours ago | | |

the interesting thing is how fast it becomes unmanagable.

ashirviskas 13 hours ago | |

> It's almost guaranteed with agents you could do the same job with less than half of 100k lines.

That's great, non-test code is only ~47k lines of code.

sreekanth850 13 hours ago | |

For a startup with limited funding, building a product is no more a bottleneck. every one doesn't have the same access to funding!

sltr 11 hours ago |

Contrarian view: Why English will never be a programming language. https://www.slater.dev/2026/05/why-english-will-never-be-a-p...

pjmlp 4 hours ago |

The moment a language is the output of a natural language compiler, the language itself is kind of irrelevant.

Change the skills, ask the agent to do exactly the same in something else.

I am slowly focusing on agent orchestration tools, which make the actual programming language as relevant as doing SOA with BPEL.

throw-the-towel 4 hours ago | |

The language may be irrelevant, but the hard guarantees it offers are not. Agents are still very stochastic, they need something deterministic constraining their output.

pjmlp 3 hours ago | | |

That is where formalisms come into play.

Also it is kind of interesting that there is so much enthusiasm to use Claude and Claw all over the place, yet lack of vision on how much the whole infrastructure will improve.

Even when it finally bursts and we get into another AI Winter, what was already achieved isn't going away.

dxxvi 6 hours ago |

The thing that impresses me most is that the author knows everything (from the high level architecture to the small details) of "multi-Paxos consensus engine" (I have no idea what it is, but it must be very complicated) and can write everything out for AI to read (or did he/she use an app to convert speech to text)?

danbruc 12 hours ago |

Paxos is certainly non-trivial in the sense that tiny changes can break it, but in terms of functionality it is not that big. 50 KLOC just seems like a lot of code to me.

bio-s 10 hours ago |

I have Tarpaulin code coverage check and everytime that it drops below the treshold Claude gives up quickly and just lowers the threshold. I don't know how to overcome it. CLAUDE.md neither AGENTS.md help but the LLM always finds its way.

nilirl 13 hours ago |

Is the idea of the runtime contracts similar to the idea of runtime validation? Or are they different in some way?

pramodbiligiri 13 hours ago | |

It is described in the "Code Contracts" section of the article: "Code contracts specify preconditions, postconditions, and invariants for critical functions. These contracts are converted into runtime asserts during testing but can be disabled in production builds for performance". The .NET framework article that he links to: https://learn.microsoft.com/en-us/dotnet/framework/debug-tra...

andai 12 hours ago | | |

Is this basically what Dijkstra was saying? I've been thinking how his approach was considered impractical, but may eventually become necessary for security/stability reasons the way things are going. (Seems like new zeroday on HN front page every day now.)

nilirl 13 hours ago | | |

Ah, I missed the reference. Thanks a lot!

kikimora 12 hours ago |

This is great example of AI slop and a big problem with AI coding.

Original RSL library has 36 KLoC across C++ source and headers files. Rust supposed to be more expressive and concise. Yet, AI generated 130k LoCs. I guess nobody understands how this code works and nobody can tell if it actually works.

jmpeax 11 hours ago | |

All unit tests can pass if you don't assert anything. Just have to make sure to read through all 130k lines of code to check.

wren6991 7 hours ago |

I've found Rust's safety guarantees to be less useful for slop-generated code because LLMs can always fight their way through the borrow checker by spamming enough Arc<Mutex<Arc<Mutex<...>>>> and clone() everywhere. Rust only gives you safety properties, not liveness. Interior mutability is a fantastic tool for turning safety failures into liveness failures. Remember kids: deadlock is a safe outcome.

It works for humans because when we get a borrow-check failure, we take a step back and think about the global shape of our code and ownership. LLMs path straight to the goal. Problem: code doesn't compile. Solution: more clone()

10g1k 12 hours ago |

Lessons. There's no such thing as learnings.

criddell 12 hours ago | |

Learnings is irritating to me. The way kids use the word aesthetic is irritating too. I wonder if I might be that old man shaking his fist at the clouds, but I have gotten over begs the question, and literally, so maybe not yet...

zahlman 10 hours ago | |

I understand the instinct, but that's a bit too prescriptive for me.

https://en.wiktionary.org/wiki/learnings

tskj 12 hours ago | |

A lesson would be a specific learning activity happening at a specific place and time, administered by a person more knowledgeable than you; like a teacher or mentor "giving a lesson".

If you're fine with the generalized form "learned a lesson", then surely "learnings" is fine too. There's no point in trying to police a completely normal and sensible use of language.

esafak 11 hours ago | | |

So when you cause an incident because you did not pay attention and "learn your lesson" who's the mentor?

chemex 12 hours ago |

How are you keeping the requirement, design, and tasks docs in sync as the code evolves? I'm curious if anyone's landed on a good workflow for this.

valcron1000 10 hours ago |

Where can we read the code?

faangguyindia 13 hours ago |

Rust code generation consumes lot of token

Go is much better target, i've observed rails/ruby code is also much easier for AI to spit out.

And Haskell flies with AI

jgilias 13 hours ago | |

Yes, but it comes with much better “built-in” guardrails to rein in the autocomplete. Especially if compared to something runtime-surprise-prone-if-lovable like Ruby.

faangguyindia 11 hours ago | | |

This is why I suggested Go.

Rust doesn't add anything over Go for LLM coding.

bharxhav 12 hours ago |

Rust is about abstractions more than code. You can ask AI to "Optimize/Test/Clarify" but at the end of the day you should be willing to blindly agree to it's output or spend more time reviewing someone else's code.