Did Claude increase bugs in rsync?

Did Claude increase bugs in rsync?(alexispurslane.github.io)

153 points by logicprog 7 hours ago | 149 comments

aesthesia 1 hour ago |

I don't have a dog in this fight, but a few points that look a little suspicious:

- The release with the highest number of attributed bugs is the release _right before_ the first release with Claude-coauthored commits, released in January; is there a chance that unattributed LLM-authored commits made it into this release?

- The release attribution methodology is not great, since it will tend to attribute bugs introduced in a minor version update to the longest-lived patch release of that minor version. I doubt that 3.4.1 actually introduced a lot of bugs, but since it was released a day after 3.4.0, bugs that were introduced in that release get attributed to 3.4.1.

- Relatedly, more recent releases have had less time to have bugs filed against them, so there may be a bit of a bias toward evaluating recent releases as less buggy.

OptionOfT 1 hour ago | |

You can use LLMs in multiple ways, from very hands on to make local changes to completely hands-off.

I've seen plenty of code that was LLM generated but the commit message itself did not have the co-author attached to it. This only seems to happen when someone's interface to the codebase is completely though Claude/Codex/..., and those are usually the most verbose commits, and yet they say the least, because they just summarize the code changes, not the why.

On the other hand I've seen developers using Claude as a tool. They have VSCode open and a terminal window with Claude and go back and forth, ensuring they write correct code, and leave the plumbing to Claude.

So maybe the author of the code started off small and it grew over time?

logicprog 1 hour ago | |

Your first and second points seem to contradict each other because if all of the bugs for 3.4.1 should be attributed to 3.4.0, that pushes the timetable back even further that unattributed LLM commits would have to have been being committed to the project, which just makes your point even more absurd.

Which brings me to my overall response, which is that there is absolutely no evidence, and nothing even intimating this hypothesis, that LLM commits were secretly being added to earlier releases before they were attributed, and that's why the rate of bugs is higher. There's no reason to think that it's an unreasonable thing to think, and there's no evidence for that whatsoever unless you beg the question and assume that higher bug counts must automatically indicate AI involvement, which is just circular reasoning. You're essentially just making up a hypothesis out of thin air to preserve your point.

Regarding your third point, that one's fair, but I've done the analysis and I can put it up if you want, as to how long it usually takes to find bugs and how far through the release cycle we are for each version.

aesthesia 1 hour ago | | |

Sorry, I should have said this explicitly in the original comment: I think you're likely _correct_ that there isn't a clear increase in the rate of bugs attributable to LLM-authored code in rsync. Your analysis provides evidence in this direction; these are just the things that made me go "hmm". They're not accusations or claims that the conclusion is invalid. But they're definitely things to be curious about.

Regarding unlabeled LLM-authored commits, I don't think it's unreasonable in general to think that an open-source project might have had unlabeled LLM-authored commits at some point before 2026. Looking more closely at rsync's recent commit history, I think it's less likely in this case. There's just a low number of commits in general, _until_ large batches of Claude-authored commits start showing up early this year. But this then raises some questions about the bugs-per-commit metric; it does correct for something like "size of release", but also obscures a significant shift in commit velocity that may be downstream of adding LLM development tools to the workflow.

Like I said, I don't have a dog in this fight, and I try not to approach sorts of questions from a position of explicit advocacy. I do think it's an interesting question, though, and we should try to understand what the data is actually telling us.

jonquark 1 hour ago | | |

Isn't the metric that you've used "bugs per commit ~ per new line of code" going to miss the issue?

All code is technical debt.

If rsync releases used to have 500 lines changed and 5 bugs in and AI-powered rsync releases have 50000 lines and 500 bugs, it's the same bugs/line but much worse experience for the user?

I've not looked into the details of this case and I do use AI assistance coding at work but in my experience, the problem is that it's too easy to write lots of code and therefore hard to review the huge volumes of code and this analysis will ignore that?

edit: actually your table shows there weren't unusually large numbers of commits in this release, so perhaps my initial skepticism shows a bias I have?

thorum 1 hour ago |

Unfortunately for the people mad about this, I predict the only thing they will accomplish by pressuring the rsync maintainers, is to discourage everyone else from responsibly disclosing their use of AI. You’re just going to make people disable Claude attribution on their commits to avoid drama.

mwkaufma 12 minutes ago |

Smokescreen of highly-contingent analysis and appeals to authority over a premotivated-conclusion.

logicprog 3 minutes ago | |

- Appeals to... what authority, exactly?

- All analysis is contingent.

- How do you know the conclusion was premotivated, and does it matter if the analysis, which is attempting to be as objective and extremely reproducible as possible, holds up?

- The whole point is that there's no actual evidence for what you are claiming, so why does it being highly-contingent cause a problem for me, when that just further shows there's no evidence for what the anti-AI crowd is saying?

- Why do the anti-AI crowd get to state wide, absolute, objective claims with cherry-picked anecdotes as their only evidence, but the pro-AI crowd is not allowed to respond the same way, and when we then go out of our way to respond in a far more thorough, rigorous, and objective way than you ever did, that's just more evidence for our guilt? It's a Kafka trap. You can't win.

scsh 7 hours ago |

> It does not control for commit complexity, security intensity, or bug severity. It does not distinguish between a one-line typo fix and a CVE patch. It is a blunt instrument. But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

If by fairest you mean to say that this analysis and response is sufficient, then I'm sorry but I have to disagree. We really need to understand if the nature of the bugs are worse from a user's perspective. Even if the rate stayed unchanged, if the result is the perceived quality of the software declined then I would personally consider that worse, especially if I were a project maintainer.

That's not meant to be wholly dismissive either. But in general, I don't think quantitative analysis alone is enough to fully answer this type of question.

MostlyStable 35 minutes ago | |

That which is asserted without evidence can be dismissed without evidence. This is more evidence, and of greater rigor, than was used to make the assertions. That's good enough for me. If someone wants to actually do the work to support the original claims with better evidence, great. I'd love to see it. Until then, I'm going to not worry about this issue.

skeledrew 6 hours ago | |

But it is fair. Up to this point I have yet to see anyone say they did an analysis of the code and found X regressions of Y severity. All they say is "there are more bugs because LLM". This analysis, which you can verify yourself if you wish, says "the bugs [number of] are pretty average even with LLM", which is a direct response to that. If you'd like a more nuanced analysis you're welcome to do one and share the result, if you're so inclined.

mikaeluman 1 hour ago |

Not going to critique this survey. Must have taken a lot of time and required a lot of patience. Great work!

I think it will be up to some group in academia to make a real full blown study across several repositories.

There must be tons to learn on how LLMs have changed software development and perhaps the cleanest separation will simply be going by what repositories declare e.g. "No LLM involved" vs those that proudly do the opposite or are neutral.

Bugs is not the only variable of interest here. I am guessing someone is already doing this as we discuss it here...

logicprog 6 hours ago |

Okay, I really have to point out to everyone: the numbers and report cards are TEMPLATED IN BY A SCRIPT. Hallucinations are a moot point. https://github.com/alexispurslane/rsync-analysis/blob/main/s...

tptacek 46 minutes ago |

This is a neat post and I'm glad it got written and this is a little bit off-topic but:

Hey, 'logicprog, your writing is fine!

Use LLMs to critique your writing, check its structure, vet your choice of topic sentences, check flow from graf to graf and section to section, look for passive voice and overused words. LLMs are fantastic for that. But don't use a single word an LLM suggests in your actual writing. If it suggests something really fucking good, too bad, those words are disqualified. It's an easy red line to adhere to, easier than it sounds, and it'll keep your writing human.

(You ended up somewhere around here anyways, but that was after you posted something with LLM-written language because you weren't confident enough in your own writing. The things you do "worse" than an LLM are what make you you; be protective of them!)

logicprog 28 minutes ago | |

Thank you!

geraneum 7 hours ago |

> But the critics' accusation is also blunt: "Claude is making things worse." A blunt instrument is the fairest response.

So the criticism was bad, and that somehow makes it ok to use a bad metric?

logicprog 7 hours ago | |

That's not what I'm saying. What I'm saying is that if the criticism is referring to a broad set of metrics like bugs per release and number of commits that were made by Claude, then it's correct to look at precisely those things because that's what the claim is about.

abirch 1 hour ago | |

AI + Interest != Expertise

I come to hn because I get very nuanced, informed information and glorious puns.

faitswulff 7 hours ago |

> The analysis uses a single metric: bugs per 10 commits (bugs/10c).

Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.

logicprog 56 minutes ago |

Another update: did an automated severity analysis on each bug report (~2000 of them!) using an LLM at temp=0 with a very strict rubric (and I checked to make sure that it rated things in a consistent, stable way using it). The rubric, LLM used, and some example ratings are included in the methodology section. For now, the information was just stored per-bug in the DuckDB and used to filter out non-bug bugs, to get a clearer signal. I'm going to try to use it to see if the post-Claude bugs were more severe in any way next.

KronisLV 36 minutes ago |

Pretty cool site!

> v3.4.3 has been out long enough that its rate (5.00) is already comparable to historical releases. The "wait and see" argument is an appeal to an unknowable future that shifts the burden of proof away from the critics. If more bugs surface, they will enter the distribution like every other release. There is no reason to expect a regime break.

I mean, as someone who uses LLMs, it might be a good idea to consider how one might limit the amount of bugs that will appear in the future at least a little bit: parallel iterative code review loops would probably be the easiest and most applicable to LLMs, though I guess test coverage and other code analysis tools help too.

PunchyHamster 13 minutes ago |

The fact last few commits were attributed to claude doesn't mean previous ones didn't use it.

Also if you write a paper where you get statistical conclusions out of whole 2 datapoints you'd be laughed out of the room

rovr138 7 hours ago |

I'm just curious about testing.

Is this a configuration that's not common and thus not tested?

If people think they can do better, I want to see their forks and them keeping up with it.

https://github.com/RsyncProject/rsync/graphs/contributors?fr...

Polarity 7 hours ago |

so the answer is: no. actaully less bugs. thanks

gjvc 58 minutes ago | |

"fewer"

overgard 1 hour ago |

The TLDR seems to be: needs more data.

yobid20 26 minutes ago |

needs a tldr; im not reading all that. maybe claude can summarize it for me.

wookmaster 7 hours ago |

Claude is just a tool ? The developers who merged that code and didn't properly test increased the bugs.

everdrive 7 hours ago | |

"Did cars increase traveling deaths?"

"Cars are just a tool. The drivers who piloted the vehicles and weren't careful enough [are responsible for the deaths.]"

roywiggins 7 hours ago | |

If something's a bad tool that misleads people into doing bad work, it would be good to know that.

runarberg 15 minutes ago | |

Feels like something a bad (and potentially dangerous) tool would say.

Angostura 7 hours ago | |

This tool is claimed to be able to find and fix bugs.

ebiederm 5 hours ago | |

Please read the article.

The unsolicited security reports are the issue.

pushcx 5 hours ago |

    What followed was extraordinary: 329 comments and counting, ranging from thoughtful concern to outright harassment.
    The thread did not stop at words. One user posted My Little Pony drawings of themselves strangling the "project janitor that pushed vibecoded commits":
    It spread to Hacker News and Lobsters, generating hundreds more comments.

This is false, it did not appear on Lobsters. Here is the function in the codebase that prohibits this kind of brigading: https://github.com/lobsters/lobsters/blob/main/app/models/st...

Please correct your article.

tptacek 1 hour ago | |

It is neat that Lobsters has this feature (and HN should too), and I'm glad you took a beat to explain it. I think you didn't need the last sentence, though.

logicprog 1 hour ago | |

I have done so! that was a misremembering on my part. first mention of Lobsters is now here:

> On Lobste.rs, in response to the Medium essay Tridge himself posted in response, finally some users like boramalper begin to actually ask for evidence one way or another:

nairboon 7 hours ago |

Is this an analysis made by/with Claude?

overgard 58 minutes ago | |

FWIW, I asked ChatGPT to review the article just for my amusement. It's conclusion was:

"My honest assessment is that this is a competent calculation performed on a badly confounded measurement, followed by conclusions substantially stronger than the calculation warrants. It is useful as a rebuttal to “the Claude releases are obviously unprecedented disasters,” but not as evidence that Claude was harmless."

quentindanjou 7 hours ago | |

It very obviously is. "The Outlier Nobody Noticed" -_-"

dang 1 hour ago |

[stub for offtopicness]

[see https://news.ycombinator.com/item?id=48416020 for how all this happened in the first place]

jrflowers 3 minutes ago |

Tl;dr:

Yes, it did. Here is some math showing that you shouldn’t care about that.

MagicMoonlight 1 hour ago |

Typical AI slop post. It’s pretty hilarious that he just added spaces before the emdashes and claims it’s human written.

If I’m hiring and I see this kind of slop, I ain’t hiring you.

Etheryte 1 hour ago | |

Emdashes don't really tell you much anything these days tbh. Many languages use them regularly and those people often bring the habit with them when they write in English — me included. Plus I would imagine every major model has tuned them way down at this point due to the backlash.

logicprog 1 hour ago | |

I rewrote all the AI prose several hours ago with purely my own. I like em-dashes, and specifically use them with spaces as a habit. I don't know what to tell you.

the_real_cher 7 hours ago |

Is there a non vibe coded fork of rsync?

throwaway7356 7 hours ago | |

Yes: https://news.ycombinator.com/item?id=48390931

So far it reintroduced several security issues and replaced the README.md.

gadrev 7 hours ago |

Ok.

  $ apt-cache policy rsync | grep Installed
    Installed: 3.4.1+ds1-7ubuntu0.2
  $ sudo apt-mark hold rsync     
    rsync set on hold.

imurray 6 hours ago | |

That version has security fixes from the same day as the latest rsync release: https://ubuntu.com/security/notices/USN-8283-1

As usual, Ubuntu backported fixes and didn't upgrade to a new version. Whether or not they also backported regressions in edge cases that afflict the latest rsync, I don't know. Pinning the Ubuntu package may prevent getting further regressions, but is preventing you getting any future such backported security fixes.

logicprog 7 hours ago | |

Did you face any actual bugs or regressions? Or are you doing this just because of the bandwagon that's going around right now? Because until you can actually present an argument for why this release is worse than any of the others, which is precisely the subject of my post, then this is not an argument against my post at all. This is just a self-referential appeal to authority.

gadrev 6 hours ago | | |

Nah, I skimmed TFA but then I went into the linked GH issues thread, and that's the one that scared me a bit. I just want to hold it for a while and not run into some of the things I'm reading since I'm on the latest ubuntu. Just a precaution.

I didn't have the time to actually think about any "arguments" at all tbh it's just a knee jerk reaction as I get ready to log off for the weekend. Not actually looking to argument for or against your post at all lol.

So my systems recently updated to rsync 3.4.3, and as soon as that happened my backup system - which does incremental backups using multiple --compare-dest= arguments - started to fail on anything but a full backup.