Grit: Rewriting Git in Rust with agents(blog.gitbutler.com) |
Grit: Rewriting Git in Rust with agents(blog.gitbutler.com) |
Hmm. That's going to be interesting.
I'd be fascinated to see what happens if it does. Both in the analyses that we'd get of what the LLM did to the codebase and on the legal decisions on what the copyrightable creative elements in code actually are.
If I was the author though... there would be no way that I would be volunteering to be a test case like this. Also seems just rude for no reason.
[US jurisdiction]: Anything in the result written by the LLM can not be copyright by anyone.
Anything in the result written by a human can be, and if it was all emitted by the LLM then that portion originally written by a human carries its own copyright.
As a work of an LLM, the entirety presumably can not be copyright, at all. Portions written by humans presumably carry their original copyright.
Previously I described it as "Models give you what you ask, for not what you want". Now with Fable they don't even give you want you want so idk.
There is no way anyone would ever use this for it's CLI - it will almost certainly always be slower and worse in every way, even if I get it stable (which it's currently not). You can use libgit2 (a project I also helped kickstart), or Gitoxide (a project GitButler also currently helps drive) - they are faster and better in nearly every way, but they are not feature complete.
This isn't for the person using Git. This is for someone trying to build a tool that wants to use parts of Git, which is different.
Rust is some ugly poo.
Probably doable - I remember most of Natural Selection 2 was Lua and it's more than a decade old at this point.
Link: https://unknownworlds.com/en/news/spark-engine-questions-and...
And yet this performs dramatically worse.
A slower, untested, incomplete git implementation, all for the low low price of $10-$15,000.
And don’t forget it wasted a bunch of human time in the process.
So if someone mentioned somewhere else there is already a Rust port a group is doing somewhere. How much could they have accomplished with this much money and time in software development resources?
Ok. AI can seemingly port stuff if you don’t test it thoroughly. I think that’s already been proven. At this point I’m seeing less and less value from these kind of things. I’m sure it was fun for the author, but how does it help other people?
If the first stereotype of Rust programmers is announcing that a project is in Rust before any other desirable software property (e.g. stable, performant, etc), the second stereotype is that Rust programmers love rewriting stuff in Rust, just for the sake of Rust.
(The 2.a. corollary is that they love rewriting GPL projects specifically and downgrading them to MIT/Apache)
Recently Casey Muratori said in a adjacent context that the microsoft AI push may be related to the fact that they have a long standing and elaborate codebase. A large historic software company could have advantages to train models. They could provide extra value with their IP.
Now their IP is potentially in their models and accessible to anyone. If they actually train models on their IP, anyone could implement their APIs and slap a GPL license on it.
At that point, things will get very interesting.
[1]. https://github.com/ianm199/lua-rs/tree/main Lua
[2]. https://github.com/ianm199/valdr Valkey/ Redis
[3]. https://github.com/ianm199/nginx-rs-port nginx
Happy to answer any questions on the approach! When I started a few weeks ago the harnesses on their own were not good enough to get very far without a "meta harness" of sorts but that is changing largely with Claude Workloads and Mythos. A lot of the work is developing some custom tooling to move these along faster.
But in terms of learning I'm learning relatively little about how to type Rust into an editor but a lot about how to set up agentic loops that can autonomously get tests to pass and improve performance.
For example if you just tell a frontier model (gpt5.5 or Claude Code 4.8) to make some portion of the tests pass they will take forever and just bang their heads against it. I developed a framework to mimic a lot of these tests in nginx... but in minimum non blocking ways so you can run many in parallel with short feedback loops.
Similar for performance - how to make tons of performance benchmark and expose maximum telemetry for agents to go and analyze the hotpaths etc.
Agree with first half of this sentence, we should all have fun with experiments.
> It was never based on a linkable and reentrant library, but instead on a "Unix" philosophy of chaining together simpler commands, which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
Ahhh now we have philosophical disagreement in the only place in the entire article that says "why". Unix is a feature, it's arguably more important in current time: https://aperocky.com/blog/post.html?slug=unix-philosophy-age...
> It was never based on a linkable and reentrant library, but instead on a "Unix" philosophy of chaining together simpler commands, which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
git operate on the filesystem level, the unix behavior is just getting buried. You cannot rewrite git into a linkable library and decide it's now not unix. It's entire behavior is unix, which is why it's awesome.
Libgit2 is meant to address this and I was heavily involved in the development of that project 15 years ago. It's great but it's not feature complete and it's development is also completely separate from git development, so it's out of sync and constantly struggling to keep up.
I downloaded v0.3.99 for Linux x86_64 and stripped the binary. It ends up at 31 MB. The .text section is 25 MB.
I'm surprised by the large size. On my system /usr/bin/git is 4.7 MB, although git is split up into multiple programs. I'm not comparing apples to apples, but this is weird.
If anyone digs into the binary size, please share what you find.
I haven't dug into this at all yet, nor have I tried to optimize the size (or really, anything else).
However, the library part will be less than half of this - a lot of code is spent on the CLI specific stuff and would not be part of the library, which is mostly what I care about for the purposes of this project. The CLI part is just to try to prove the point that it actually does what Git does. The library part is what might be useful in that nothing else exists that does all of the things that it does (provide a reentrant linkable library that is feature complete with Git).
Similarly, is there any momentum left for Cloudflare's EmDash? I can barely find any discussion after April.
> it has been nearly entirely written by agents and has not been used for realsies. It's probably currently unusably slow or completely broken in ways that are not exercised in the test suite.
Right now it's someone else's experiment that is still in the "might or might not pan out" stage.
There are a bunch of projects using the similar (not vibe coded, less fully featured) gitoxide project - there is demand for git-as-a-library.
The author of gitoxide is also working on GitButler (who worked on this project) and we're pushing both projects forward and actively using and developing Gitoxide as well. This is simply a different and hopefully complimentary approach to the same problem.
Why not just make better Python bindings to libgit?
It's an organic success, hard to replicate. If at all, CF can only make people migrate with massive effort. Marketing effort, selling lots of snake oil in the process. WP wont just hop on the hot new thing, WP is the definition of the opposite. It works for them. Why change.
Git is the same on the other side. It requires maintenance and improvements, surgical and correct. No git maintainer has time to learn a gigantic new codebase and they will stick with what works for them. For git users there are no advantages. So similarly it would require a long time effort to push the project, building trust that it is somehow better, probably requiring Linus to say "it's great".
You don't get to choose a license and then add extra terms to it when you don't feel like it's up to scratch. That's something explicitly not allowed by the GPL license.
Isn’t having to stay under the GPL a very big part of the GPL license?
Also, I worked on the Ruby Grit pretty extensively during the early days of GitHub, so hopefully I earned the right to carry on the mantle. :)
I want to get it to the point where we can replace fork/exec'ing to an unknown Git binary or having said binary be an external dependency for GitButler. The networking stuff (push/fetch) is currently an external dep for both GitButler and Jujutsu (and pretty much every other Git-based tool in the world). I'm pretty sure I can get the project good enough at these networking ops (including all the hairy credential stuff) to be able to not need those fork/exec calls.
> Currently both Gitoxide and libgit2's networking functionality is either partial, slow or non-existant. Both GitButler and Jujutsu rely on forking out to Git in order to push or pull data. A big reason for this is the incredibly complicated credential logic involved, but all of this is (theoretically) currently covered in Grit.
We clearly learned from how Git does operations and emulated it in order to function interoperably, the same way that Gitoxide and libgit2 have, and released it under a license that would be the most valuable for people wanting to use a linkable library, the same way that Gitoxide and libgit2 have.
You decide whether you have followed it or not. The other party will decide if they agree. If in dispute, you go to a judge and they decide also.
it's just in this case it's the author. we'll have to wait and see who decides to challenge it
In fact, I would rather it stay C for 15 more years.
Don't bother.
It's probably not for you. It's slower, more obtuse, more bloated, less capable, exponentially less scalable at any size. Canonical Git is better in every way, except being a linkable library.
Even in the arena of being linkable libraries that can do Git stuff, both Gitoxide (Rust) and libgit2 (C which has git2 crate Rust bindings) are both better, they're just not feature complete. That is the only point of this project.
Why not 100%?
> It's not actually passing every single test, though that is on purpose. I did mark some parts of the testing suite as "skipped" because I don't think it's worth recreating them in a library like this
> 41,715 / 42,001 tests passing (99.3%)
So it is not entire then but somehow that was worth burning $8,000~ dollars worth of tokens?
From the article
> It's not actually passing every single test, though that is on purpose. I did mark some parts of the testing suite as "skipped" because I don't think it's worth recreating them in a library like this - email related stuff, i18n, perforce/svn importers, some of the midx/bitmap stuff - things of that nature. However, for everything that I'm sure is relevant to nearly anyone reading this, the Grit library/CLI can now fully pass the Git test suite.
Reimplementation is a particularly juicy target because it's easy to test. Imagine someone writing a better browser than Chrome from scratch in just a year.
Because of this moats around business due to difficulty of implementation are effectively gone.
I don't care if any git I use has email features. IIUC, even most of the people that use git with email don't directly use the email features, they use the patch set features like `git am`. I expect `git am` to work, I don't expect git to actually do email.
Well, it's sort of for Rust. GitButler is written in Rust and Jujutsu is written in Rust and we're both depending on fork/exec'ing to an unknown Git binary with no linkable library and no control over the subprocess to do a range of networking stuff. Neither Gitoxide or libgit2 are capable of this either, as much as I love and support those projects.
This project is entirely about providing a feature complete (even if sloppy) library implementation of Git, which does not otherwise exist.
Thanks.
The point is to provide a feature-complete reentrant linkable library. Even if it's an ugly and slow one, this is still the only one thing that exists that covers those points - Gitoxide and libgit2 are both awesome but they are not feature complete.
> it made me wonder about the feasibility of using that same approach to accomplish something I've been dreaming about for 15 years now,
> which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
> What if we used the same basic idea that Anthropic used on their from-scratch C compiler? Start a brand new implementation, design it as a Rust library, then throw a swarm of agents at the problem
> Having parts of Git as discrete, embeddable slices of library also enables things like building custom Git servers or client functionality in Rust.
> The full build of all Git functionality in Rust is currently around 27M, but since a large part of it is a library, it could clearly be easily split up into domains of functionality - subcrates that do specific things. Perhaps you could simply use the subset you need.
The first part of this sentence (where in the GPL) is unreached if the second part of it is unmet (relicense code or derivatives) which I contend it likely is. You're begging the question.
However:
> The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work
earlier:
> A “covered work” means either the unmodified Program or a work based on the Program.
It's that element that would be difficult to prove "work based on the Program"
"here's a test suite, write code in rust that makes that suite pass" is reasonably supported by the article. That would likely not be a derivative work.
That's not actually the case at hand here - the agents were given the original source to reference: https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...
But for the sake of argument: The test suite itself is copyrighted. To the extent the resulting work is a derivative of the test suite it is possibly infringing. For example you might example that the agent would derive variable names, function names, structure sequence and organization of the code from the test suite. It might even copy comments wholesale. Those are copyrightable things. (Which is of course just the first step in analyzing if it is infringement, there would be interesting fair use, de-minimis copying, etc arguments following a conclusion that any of those were copyrighted. A product produced this way definitely could be infringing given the right facts though).
My use of the word "similar" does not imply here that I think it's obvious that they are "similar" in any copyrightable elements - whether they are or not is one of the interesting questions I think this case would have to resolve.
Incidentally you're also allowed to make similar creative elements so long as they aren't copies and you did so independently... which could actually come up in a case like this (imagine the LLM produced a similar function to some function in the original... but the original wasn't in the context window at the time. Not at all unlikely with code where there often is only one or two natural ways to write something).
Basically I use these "kits" to prove that the behavior is working as expected with mocked data/ interfaces and then only after these kits pass I'll run the real test suite files as confirmation. So these let you iterate a lot faster than the official test suite because it is very slow.
These are bootstrapped from the real tests.
The other commenter was being a bit dismissive but this is the kind of thing I'm taking away as a real useful pattern to do verification of behavior at scale.
That is true, however did you actually do any research into nginx? Is it particularly prone to memory bugs?
I honestly don't know the answer but you seem to be coming from a place of C bad, therefore nginx super vulnerable?
In my experience with other web servers the vast majority of security bugs are string handling related (path/header injection), which your rewrite will not protect you from.
The project was inspired by that. Also unlike most other projects, nginx is directly exposed to the internet often times which makes it more vulnerable than i.e. Redis/ Valkey or something that would be running within a companies network generally.
"C Bad" is a bit reductionist... but I think there is some truth to the take " Until you have the evidence, don’t bother with hypothetical notions that someone can write 10 million lines of C without ubiquitious memory-unsafety vulnerabilities – it’s just Flat Earth Theory for software engineers" [1]
NSA and other government orgs are also pushing people to stop using C [2] for important software.
[1]. https://alexgaynor.net/2020/may/27/science-on-memory-unsafet... [2]. https://linuxsecurity.com/news/government/nsa-s-plea-stop-us...
I have no idea why you are making me spell this out, I thought it was pretty obvious.
I could have missed them. I didn’t read everything. I did some quick searches.
But the fact they’re not obvious is kind of troubling. Or that they didn’t just copy the tests and documentation for the LLM and not the source to prevent it from looking would hurt any case they had for clean-room privileges in my eyes, ignoring my other comment with concerns about using the tests at all.
IMO, IANAL, etc.
And we’ll ignore the question of what the fact the LLM has certainly seen the git code during training means.
But the test suite would have to stay under the original license. And if you use a GPL test suite as they kernel to develop a program from can you license it non-GPL? I’d question that personally. Same acronyms above apply.
> A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
So assuming that sum(a, b) is non-infringing and not combined to form a larger program (i.e. the tests aren't compiled into the grit code), then the GPL explicitly doesn't apply to this use
But if you take all the individual tests used to test git as a whole, that seems far more unique. Seems like at that point you’re really having to duplicate the actual git internals, and that seems like it should be covered.
No one really knows what the endgame of software security looks like.
So some people should try the port to rust angle, some should focus on hardening the C, some should explore more exotic options like formally provable languages etc
Substitutibility probably doesn't apply here in the way you're implying and if it did it would likely be hampered by the 9th circuits findings about transformation in sony v connectix. Arguments here likely would look at rust not having a stable ABI, and hence not being inherently substitutable as a libray (grit-lib), less clear as an executable (grit-cli) on that side
basics of copyright law - the fundamental thing being protected is the expression... is a rust program's expression the same expression as a c program? I'd say generally not.
yeah fair - the "The canonical Git source code we're targeting to replicate the functionality of is in the git/ subdirectory." part makes this hard to argue against.
> To the extent the resulting work is a derivative of the test suite it is possibly infringing
It's this bit that I have a problem with. If I run the test, it fails and reports a failure. Now I write code and run the test again. What is the theory there that code that I wrote infringes.
Simplify this down:
Assume the following is copyrighted:
fn test_sum() {
assert_eq!(sum(1, 1), 2);
}
Does writing the following code: fn sum(a: u8, b: u8) {
a + b
}
infringe on the test copyright? fn sum(a: u8, b: u8) {
a + b
}
Doesn't infringe upon copyright period, because there's no creative element in that work.Imagine a more substantial example though. Perhaps you have a test that checks that some file written in a binary format is correct, and gives names (creative elements) to each field of the format that it prints when you mess up the field, and has comments describing why the bytes are laid out like they are (the comments being copyrightable even if the facts they describe aren't), and the LLM copies those field names and comments verbatim... Now it's quite likely that the LLMs work is a derivative of the test suite.
There's likely a threshold at some point. It's helpful to look at a minima and then continue from there though.
I'm curious if there's case law that supports your assertions here?
Feel free to extrapolate to the threshold where it's not and at that point apply.
> you’re really having to duplicate the actual git internals
Copyright covers the expression, not the method. So the Rust function:
fn sum(a: u8, b: u8) {
a + b
}
is distinct from the C function: int sum(int a, int b)
{
return a + b;
}> “So long as the specific code used to implement a method is different, anyone is free under the Copyright Act to write his or her own code to carry out exactly the same function or specification...”
Here given that this is rust and the original expression is C, the implementations cannot be the same by definition.