The reason svn is broken is its "rep-sharing" feature, i.e. file content deduplication. It uses a SQLite database to share the representation of files based on their raw SHA1 checksum - for details see http://svn.apache.org/repos/asf/subversion/trunk/subversion/...
You can mitigate this vulnerability by setting enable-rep-sharing = false in fsfs.conf - see documentation in that file or in the source at http://svn.apache.org/viewvc/subversion/trunk/subversion/lib...
This feature was introduced in svn 1.6 released 2009, and made more aggressive in svn 1.8 released 2013 https://subversion.apache.org/docs/release-notes/
SVN exposes the SHA1 checksum as part of its external API, but its deduplication could easily have been built on a more secure foundation. Their decision to double down on SHA1 in 2013 was foolish.
I rather believe it's a minor bug, and that once it is fixed, they can actually keep using SHA1 as before, without having the denial of service when somebody tries. Then, for example, if somebody actually tries to put two files with the same SHA1 but different MD5 they can reject the second one before accepting it. Or they if there are two different files with same SHA1 and they accepted both and they store only one content, SVN can still continue to work. So you can't get the second unless you, for example, put it in some archive format first and then put in the SVN, OK, your problem, the SVN would still work for anything else.
In short, it sounds like a denial of service at the moment, but I think that DOS can be avoided without changing the hash algorithm.
However, I'm sure that SVN is not the only source base that was never up to now tested with two different files that have the same SHA1.
Andreas Stieger (SUSE, SVN) has written a pre-commit hook script which rejects commits of shattered.io style PDFs
https://svn.apache.org/viewvc/subversion/trunk/tools/hook-sc...
This is the first mitigation available. If you are responsible for an SVN server at risk, please make use of this hook.
If somebody could make a similar hook for Windows and post it here or to dev@subversion.apache.org that would be highly appreciated.
(edit: switched script link to HTTPS)
Of course, I first tested this on our main production repository at work because...oh, wait, I didn't because what were you thinking?!
My guess is that Git wouldn't be 'hosed' like SVN, since it currently doesn't have a secondary hash to detect the corruption. It would simply restore the wrong file without noticing anything was amiss.
Why hasn't Git switched to SHA-2? People have been warning that SHA-1 is vulnerable for over a decade, but that vulnerability was dismissed with a lot of hand-waving over at Git. Is it a very difficult technical problem to switch, or just a problem of backward compatibility for existing repos (i.e., it would be expensive to change everything over)?
I really tried with SVN (wanted something better than CVS) for quite a long time.
I much prefer that git's designed to let me do such things and provides tools for doing so, but you can totally rewire svn repos with vi and a bunch of swearing if necessary.
(and I was using svk for a merge tool at the time so I did have the option to burn it down and rebuild from scratch; unhosing svn repos wasn't quite unpleasant enough for me to want to do so)
Then again, I started off doing more ops than dev and have also happily hand-edited mysql replication logs to unfuck things after a partial failover, so I may have more of a masochistic streak than you do :)
The bugs you can expect from software that assumed no hash collions are going to be pretty arbitrary. There was that stack overflow post about what happens with Git with collisions and it didn't seem great either, it's just that what gets hashed happens not to collide in this case.
https://lists.webkit.org/pipermail/webkit-dev/2017-February/...
> For the record: the commits have been deleted, but the SVN is still hosed.
> Subversion 1.8 avoids downloading pristine content that is already present in the cache, based on the content's SHA1 or MD5 checksum.
https://subversion.apache.org/docs/release-notes/1.8.html#pr...
I'm guessing shattered-1.pdf and shattered-2.pdf have identical hashes but distinct contents. It's not clear for me to know why this results in a "checksum mismatch."
Checksum mismatch: LayoutTests/http/tests/cache/disk-cache/resources/shattered-2.pdf
expected: 5bd9d8cabc46041579a311230539b8d1
got: ee4aa52b139d925f8d8884402b0a750c
EDIT: see https://news.ycombinator.com/item?id=13725312 for the answer $ sha1sum shattered*
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf
$ md5sum shattered*
ee4aa52b139d925f8d8884402b0a750c shattered-1.pdf
5bd9d8cabc46041579a311230539b8d1 shattered-2.pdf
As you can see.[1] https://news.ycombinator.com/item?id=13715146
[2] https://www.iacr.org/archive/crypto2004/31520306/multicollis...
A hash is not a unique identifier. Period. It's only useful as a quick filter before you do a full comparison.
It's amazing how resistant people are to using hashes safely. They willfully ignore the birthday paradox and say LA LA LA 1/(2^128)=0 and because they haven't lost all their data yet they tell themselves that their shoddy practices are OK.
Edit: Mitigate it rather, not solve it.
A bit of both. Git has an ongoing effort to replace everywhere in the source code that passes around SHA1s as fixed-size arrays of bytes with a data structure. That'll make it possible to replace the hash. But even with that work, git will still need to support reading and writing repositories in the existing format, and numerous interoperability measures.
Back then Linus shot this down in his typical abrasive fashion:
> It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want.
The key phrase being "looks halfway sane". Git doesn't just look at the hash. It looks at the object structure too (headers) and that makes it highly resistant to weaknesses in the crypto alone. His point essentially is you should design to expect crypto/hash vulnerabilities, and that's a smart stance, as they are discovered every few years.
Linus was not talking about the object headers, but about the object contents. It's harder to make the colliding objects look like sane C code, without some strange noise in the middle (which wouldn't be accepted by the project maintainers).
Yes, it's a "C project"-centric view, but consider the date: it was the early days of git. The main way of receiving changes was emailed patches, not pull requests. Binary junk would have a hard time getting in. And even if it did get in, the earliest copy of the object wins, as long as the maintainers added "--ignore-existing" to the rsync command in their pull scripts (yeah, this thread seems to be from before the git fetch protocol), as mentioned earlier in the thread.
> You are _literally_ arguing for the equivalent of "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected".
> That's not engineering. That's five-year-olds discussing building their imaginary forts ("I want gun-turrets and a mechanical horse one mile high, and my command center is 5 miles under-ground and totally encased in 5 meters of lead").
> If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles.
> And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_.
> So please stop with the theoretical sha1 attacks. It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want. Even the "breakage" doesn't actually do that. And if it ever _does_ become true, it will quite possibly be thanks to some technology that breaks other hashes too.
> I worry about accidental hashes, and in 160 bits of good hashing, that just isn't an issue.
I don't mean this to say that you are being inaccurate, just that his current position seems a little different now:
"Again, I'm not arguing that people shouldn't work on extending git to a new (and bigger) hash. I think that's a no-brainer, and we do want to have a path to eventually move towards SHA3-256 or whatever" http://marc.info/?l=git&m=148787457024610&w=2
I was just answering the question "Why hasn't Git switched...People have been warning that SHA-1 is vulnerable for over a decade"
Linus' 12-year-old opinions are the relevant thing for why it hadn't changed. A decade from now, things may be different.
> Do we want to migrate to another hash? Yes.
Our interstellar successors, if any, will probably have found something better than Git to use.
EDIT: I should be clear that I'm not making the usual silly claim that we don't need to worry about hashes being broken because they take forever to brute force. I'm saying that hashes will be broken, but not by brute forcing the entire hash space. A decade or so of cryptographic research will save you eons of compute time.
The 2^80 space cost of doing a birthday attack is a lot more notable, but it's not unfeasible either. Yearly hard drive production is somewhere around 2^70 bytes. You're not pulling that attack off in the year 2017 but a big budget you could get probably get there in a few decades.
But checking 2^80 hashes and writing them to long term storage is still ridiculous. That budget should still go to hiring cryptographers, not buying hard drives.