The beginning of Git supporting other hash algorithms

The beginning of Git supporting other hash algorithms(github.com)

427 points by _qxtl 9 years ago | 125 comments

bk2204 9 years ago |

I'm the person who's been working on this conversion for some time. This series of commits is actually the sixth, and there will be several more coming. (I just posted the seventh to the list, and I have two more mostly complete.)

The current transition plan is being discussed here: https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...

rurban 9 years ago | |

I do like your hashname/nohash idea. If we could come up with a simple compression negotiation protocol also: zlib -> zstd. But this will be much harder, as hashes are internal only, and compression is in the protocol.

kudos to brian m carlson to convince linus to use sha3-256 over sha256. this is really the only sane option we have.

weinzierl 9 years ago | | |

I don't understand what you mean by "hashes are internal only"? Aren't the sha1's everywhere right now. I mean not only in the protocol but also part of the UI and from there they even spread into bug trackers, documentation and so forth.

lisper 9 years ago | | |

> this is really the only sane option we have

Why?

weinzierl 9 years ago | |

Did you make any measurements or back of the envelope calculations what the real world performance impact of this change is.

I don't expect anything horrible, but still curious.

EDIT: After skimming OP I found a few answers.

The message from the The Keccak Team [1] is especially interesting. Summary is that we don't have to worry about performance degradation because of the hash calculation itself. There is a palette of functions which are considered to have a "security level [...] appropriate for your application" and are considerably faster than SHA1.

[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

hsivonen 9 years ago | | |

If git changed to BLAKE2b, I'd expect a perf improvement over SHA-1.

sorenbs 9 years ago | |

Out of curiosity: when did you start to take the first serious steps in this direction?

bk2204 9 years ago | | |

From the commit history, 2015 (commit 5f7817c85d4b5f65626c8f49249a6c91292b8513).

I proposed the idea of improved compile-time checking and maintainability, as there wasn't originally much interest in a new hash function, but the maintainability improvements were something people could go for.

I hadn't spent as much time working on it as I am now, so it moved slowly. Other people also helped by converting parts of the code that they were working on (like parts of the refs subsystem).

drostie 9 years ago | |

I'm not quite so familiar with the Git internals, how do you deal with the problem of having different non-leaf nodes scattered through the directory tree?

This might be a non-issue based on how Git stores the tree, but I can imagine one simple model where each directory would be a sort of "collection object", a binary encoding of a list of (filename, hash) pairs in filename order, and therefore the directory gets a hash of its own. But that means that when you're communicating with a SHA-1 repository you don't just need to rename this object; its contents also need to be changed pre-rename, and then you need to store every internal node twice. I'm not seeing that in your summary.

Is it just that Git doesn't have any internal nodes in the directory tree per se because the "filename" is a full POSIX path with subdirs? Or what?

evmar 9 years ago | | |

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects has descriptions of the objects. Both trees and commits are hashes over data that includes hashes of other objects so they must be different. The doc discusses converting them at transmission time, search for [convert to sha256] in it.

snakeanus 9 years ago | |

>b. A SHA256 repository can communicate with SHA-1 Git servers and clients (push/fetch).

Wouldn't fetching from a sha-1 repository degrade security? I think it would be better to show a warning (similar to how openssh does with 1024 bit dsa keys) every time you try to fetch from a SHA-1 git repo. Same for pushing a signed commit to a sha-1 repository.

bhhaskin 9 years ago | | |

The sha1 hash isn't used for security. You should be signing your commits if security is a concern.

lvh 9 years ago |

From a cryptographer's perspective, everything around SHA-3 is a little weird. We ended up with something that's pretty slow even though we had faster things, for which general consensus was that they were just as strong. Similarly, consensus was that some SHA-3 candidates made it as far as they did because they are drastically different from previous designs. Picking a major standard takes a while, and immediately preceding it we saw scary advances in attacks on traditional Merkle-Damgard hashes like SHA-0, SHA-1. Not SHA-2, but it's pretty similar, so the parallels are obvious.

Bow that we have SHA-3, we ended up with a gazillion Keccak variants and Keccak-likes. The authors of Keccak have suggested that Git may instead want to consider e.g. SHAKE128. [0]

[0]: https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

It's a bit unfortunate that this is really a cryptographic choice, and it seems to mostly be made by non-cryptographers. Furthermore, the people making that choice seem to be deeply unhappy about having to make it.

This makes me unhappy, because I wish making cryptographic choices got much easier over time, not harder. While SHA-2 was the most recent SHA, picking the correct hash function was easy: SHA-2. Sure, people built broken constructions (like prefix-MAC or whatever) with SHA-2, but that was just SHA-2 being abused, not SHA-2 being weak.

A lot of those footguns are removed with SHA-3, so I guess safe crypto choices are getting easier to make. On the other hand, the "obvious" choice, being made by aforementioned unhappy maintainers, is slow in a way that probably matters for some use cases. On the other hand, not even the designers think it's an obvious choice, I think most cryptographers don't think it's the best tool we have, and we have a design that we're less sure how to parametrize. There are easy and safe ways to parametrize SHA-3 to e.g. fix flaws like Fossil's artifact confusion -- but BLAKE2b's are faster and more obvious. And it's slow. Somehow, I can't be terribly pleased with that.

lvh 9 years ago |

FWIW, Fossil released a version with backwards compatibility, configurable graceful upgrades a week ago: https://www.fossil-scm.org/index.html/doc/trunk/www/changes....

wolf550e 9 years ago | |

Dmitry Chestnykh wrote a little about problems with the documented security claims of Fossil SCM 3 days ago:

https://twitter.com/dchest/status/842489752892968960

https://twitter.com/dchest/status/842498609652383744

david-given 9 years ago | | |

Given that both claims are unreferenced and using deliberately provocative language, I'd say he wrote very little...

corbet 9 years ago |

This work actually began in 2014... https://lwn.net/Articles/715716/

VMG 9 years ago |

Is there some explainer on how the support will look like in the end? I'm curious to know how multiple hash algorithms will be supported in parallel.

pyed 9 years ago | |

Probably newer versions will commit only using a new hash algorithm, while completely able to deal with the old one

ebbv 9 years ago | | |

Can it really be that simple though? If you are using a newer version of Git on your repo which is committing only with the newer hash and I try to clone your repo with an older version I will be unable to do so. I guess maybe that's acceptable though?

AlexCoventry 9 years ago | | |

I don't have a good solution to this, but that sounds like it risks the same sort of crypto downgrade vulnerabilities which TLS cipher negotiation enabled.

benhoyt 9 years ago |

I immediately looked at the length of this commit's hash to see if it was longer than 40 hex chars -- but no, it's just an SHA-1. It would have been cool if somehow the hash of this commit that added new hashes was a new hash.

Slightly similar: for a while I've wanted to recreate just enough of git's functionality to commit and push to GitHub. My guess is the commit part would be pretty trivial (as git's object and tree model is so simple) but the push/network/remote part a bunch harder.

gkya 9 years ago |

The commit on git.kernel.org: https://git.kernel.org/pub/scm/git/git.git/commit/?id=e1fae9...

zoren 9 years ago |

Someone please remind me why the hash is not a type definition so the representation would only have to be changed in one place.

ossmaster 9 years ago |

So could be my ignorance of this project in detail, but where are the tests for this?

smileysteve 9 years ago | |

The t/ directory.

https://github.com/git/git/blob/master/t/README

kozak 9 years ago |

Do they anticipate that one day we'll have to move from SHA256 to something else again? It's only matter of time. Hash function have lifecycle. Tre transition has to be done in a way that will also make the next transition more straightforward.

chmod775 9 years ago | |

Reading even one changed line tells us that they replaced hardcoded char arrays for SHA1 with a generic struct that could be used as a container for any hash.

Some functions that previously operated on those char arrays have been changed to deal with the more generic struct instead.

angry_octet 9 years ago | | |

I consider it unlikely that it will change again, but somehow it is unsatisfying that it doesn't have a hash version, e.g. in the first nibble of the hash. If they had done that we could have avoided the unpleasantness long ago.

anilgulecha 9 years ago | |

A note on 'lifecycle': that's not how it works -- the age of use/lifecycle is not a function of the bit-length in hash, or inevitability of the current standard being broken.

Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses -- the functions had cryptanalytic attacks, which reduced bruteforce from the complete keyspace to something of a much smaller magnitude. These weaknesses are what has lead to the deprecation of MD5 and SHA1.

It is definitely possible that new crypt-analytic attacks could be shown on SHA256/512, but none have so far been publicly provided. Hence the confidence in them.

amluto 9 years ago | | |

> Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses

Not true. A 128-bit hash gets collisions after ~2^64 tries. A big cluster can find targeted 128-bit collisions. To attack something like git, the entire attack can be done offline.

The big MD5 X.509 break needed cryptanalysis to make it day I because the attack needed to happen in real time.

kozak 9 years ago | | |

Yep, I'm not about the bit length: 256 bit "should be enough for everybody". But algorithms to generate those 256 bits will change.

btrask 9 years ago |

This is the chance to get rid of the object prefixes (i.e. "blob" plus file length) that prevent the generated hashes from being compatible with hashes generated by other software.

koolba 9 years ago |

Since the majority of us are running x64 machines, will the hash be a truncated SHA-512/256 or will it be SHA-256? The former is significantly faster on x64 machines.

joatmon-snoo 9 years ago | |

The RFC is still under discussion (there are a few plans going around) but the strong contender right now is SHA3-256, no truncation.

snakeanus 9 years ago | |

>Since the majority of us are running x64 machines

We don't.

koolba 9 years ago | | |

I didn't say all. I said the majority. If you think I'm wrong, show me a statistic that shows the most common platform for developers using git isn't x86-x64.

kazinator 9 years ago |

What problem does this solve? Are collisions common?

krallja 9 years ago | |

Until a few weeks ago, SHA-1 collisions had never been demonstrated.

kazinator 9 years ago | | |

But, in any case, that's in the cryptographic realm.

Git hashes aren't digital signatures for cryptographic authenticity.

pwdisswordfish 9 years ago |

struct object_id was introduced in this commit, in 2015:

https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817...

So this change doesn't do much for now. Good to see, though.

bk2204 9 years ago | |

Yes, this is correct. The struct object_id changes don't actually change the hash. What they do, however, is allow us to remove a lot of the hard-coded instances of 20 and 40 (SHA-1 length in bytes and hex, respectively) in the codebase.

The remaining instances of those values become constants or variables (which I'm also doing as part of the series), and it then becomes much easier to add a new hash function, since we've enumerated all the places we need to update (and can do so with a simple sed one-liner).

The biggest impediment to adding a new hash function has been dealing with the hard-coded constants everywhere.

myst 9 years ago | | |

Says something about quality of the codebase.