The current transition plan is being discussed here: https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...
kudos to brian m carlson to convince linus to use sha3-256 over sha256. this is really the only sane option we have.
Why?
I don't expect anything horrible, but still curious.
EDIT: After skimming OP I found a few answers.
The message from the The Keccak Team [1] is especially interesting. Summary is that we don't have to worry about performance degradation because of the hash calculation itself. There is a palette of functions which are considered to have a "security level [...] appropriate for your application" and are considerably faster than SHA1.
[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...
I proposed the idea of improved compile-time checking and maintainability, as there wasn't originally much interest in a new hash function, but the maintainability improvements were something people could go for.
I hadn't spent as much time working on it as I am now, so it moved slowly. Other people also helped by converting parts of the code that they were working on (like parts of the refs subsystem).
This might be a non-issue based on how Git stores the tree, but I can imagine one simple model where each directory would be a sort of "collection object", a binary encoding of a list of (filename, hash) pairs in filename order, and therefore the directory gets a hash of its own. But that means that when you're communicating with a SHA-1 repository you don't just need to rename this object; its contents also need to be changed pre-rename, and then you need to store every internal node twice. I'm not seeing that in your summary.
Is it just that Git doesn't have any internal nodes in the directory tree per se because the "filename" is a full POSIX path with subdirs? Or what?
Wouldn't fetching from a sha-1 repository degrade security? I think it would be better to show a warning (similar to how openssh does with 1024 bit dsa keys) every time you try to fetch from a SHA-1 git repo. Same for pushing a signed commit to a sha-1 repository.
Bow that we have SHA-3, we ended up with a gazillion Keccak variants and Keccak-likes. The authors of Keccak have suggested that Git may instead want to consider e.g. SHAKE128. [0]
[0]: https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...
It's a bit unfortunate that this is really a cryptographic choice, and it seems to mostly be made by non-cryptographers. Furthermore, the people making that choice seem to be deeply unhappy about having to make it.
This makes me unhappy, because I wish making cryptographic choices got much easier over time, not harder. While SHA-2 was the most recent SHA, picking the correct hash function was easy: SHA-2. Sure, people built broken constructions (like prefix-MAC or whatever) with SHA-2, but that was just SHA-2 being abused, not SHA-2 being weak.
A lot of those footguns are removed with SHA-3, so I guess safe crypto choices are getting easier to make. On the other hand, the "obvious" choice, being made by aforementioned unhappy maintainers, is slow in a way that probably matters for some use cases. On the other hand, not even the designers think it's an obvious choice, I think most cryptographers don't think it's the best tool we have, and we have a design that we're less sure how to parametrize. There are easy and safe ways to parametrize SHA-3 to e.g. fix flaws like Fossil's artifact confusion -- but BLAKE2b's are faster and more obvious. And it's slow. Somehow, I can't be terribly pleased with that.
Slightly similar: for a while I've wanted to recreate just enough of git's functionality to commit and push to GitHub. My guess is the commit part would be pretty trivial (as git's object and tree model is so simple) but the push/network/remote part a bunch harder.
Some functions that previously operated on those char arrays have been changed to deal with the more generic struct instead.
Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes, but they had cryptographic weaknesses -- the functions had cryptanalytic attacks, which reduced bruteforce from the complete keyspace to something of a much smaller magnitude. These weaknesses are what has lead to the deprecation of MD5 and SHA1.
It is definitely possible that new crypt-analytic attacks could be shown on SHA256/512, but none have so far been publicly provided. Hence the confidence in them.
Not true. A 128-bit hash gets collisions after ~2^64 tries. A big cluster can find targeted 128-bit collisions. To attack something like git, the entire attack can be done offline.
The big MD5 X.509 break needed cryptanalysis to make it day I because the attack needed to happen in real time.
https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817...
So this change doesn't do much for now. Good to see, though.
The remaining instances of those values become constants or variables (which I'm also doing as part of the series), and it then becomes much easier to add a new hash function, since we've enumerated all the places we need to update (and can do so with a simple sed one-liner).
The biggest impediment to adding a new hash function has been dealing with the hard-coded constants everywhere.
Also your Git binary, if compiled with only the One True Hash™, wouldn't be able to work with older repos at all because the hashes it's calculating are now different.
(Edit: Another benefit of generalizing this is so that if/when, in the future, the new hash algorithm must be abandoned due to weaknesses, Git tooling will have been already introduced to the notion that hashes can be different and should hopefully be a less involved migration the next time around)
In my experience, generalizing ahead of need more often than not causes problems, and I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.
SHA-2 and RIPEMD.
> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?
That's the problem: the software industry is still suffering from MD5 getting cracked [0]! Cryptographic agility is a baseline requirement for security primitives.
> In my experience, generalizing ahead of need more often than not causes problems
I agree and Linus has valid complaints about security recommendations during the 25-year history of Linux: most of the security recommendations kill performance and are only partial fixes, so why bother?
But Linus is also engaging in premature optimization: computers are ~30 billion times faster than when he first starting programming Linux. Yes, SHA-2 is relatively slow, they could have at least not hardcoded SHA-1 into the codebase and protocol.
> I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.
You clearly haven't done any safety related engineering. That's the thing about cryptography: millions of dollars and human lives are at stake. Despite the smartest people in the world working on these problems, cryptographic primitives/protocols are regularly broken. Due to Quantum computing, every common cryptographic primitive we use today will need to be replaced or upgraded at some point.
Thankfully, you don't need to worry about the engineering of a given cryptographic primitive as long as you can swap it out with a new one. But when you hardcode a specific hash function and length into your protocol/codebase you are now assuming the role of a cryptographer.
SHA-2.
> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?
Well, the real question is why someone picked SHA-1 over SHA-2 in 2005 when attacks that reduced its strength were already being demonstrated.
I still recall freshly the hoopla over Bitkeeper licensing that lead to Torvalds creating Git.
To derisively say "remind me why not X" at a diff that does X ... I am amused.
It is often hard to generalize when N=1. Now that the N=1 use case is established and we are moving towards N=2, it is painfully obvious to all that a better abstraction is needed.
Typedef or no, we would still need a full audit of the code to find spots where people "inlined" the expansion.
IMO, Linus should have done better here -- no crypto hash lasts forever, but this code is far cleaner than useless layers of abstraction.
(Hint: that's why GPG signing commits is an option.)
It'd be like disabling TLS 1.0 and 1.1 on your server; a repo owner could just choose to do that. I guess the point stands if, like TLS downgrades, on the whole people don't specifically choose to do it and there are lots of vulnerable repos out there for a long time. Then it falls on GitHub/etc. to force repos to migrate fully.
Design deficiency
(This is unrelated to the choice of hash.)
Fossil stores blobs as-is. A file containing "hello world" will be stored as "hello world" and referenced as HASH("hello world").
Commits are stored as plain-text manifests, which are also referenced as HASH(manifest_contents). To distinguish between different types of artifacts (commit, file, wiki page, etc.), Fossil checks the contents of the blob.
See https://www.fossil-scm.org/index.html/doc/trunk/www/fileform... for detailed description.
This made possible the following attack:
* Clone repository.
* Modify some files, commit.
* Deconstruct repository.
* Attach the deconstructed artifacts with changes to a ticket in the original repository or to a wiki.
By doing this, you could make commits to the target directory by attaching files to tickets or wiki, and these commits were only visible to people who cloned the repo until rebuilding (then they would be visible to everyone).
This attack was prevented by compressing every attached file with gzip, making it impossible to attach a file that would be recognized as a commit, because gzip adds its own header.
I think this design is deficient: instead, each blob should have a type indicator — that is, file artifacts should have some prefix. This is how Git works: each object has a prefix indicating its type. Also, Plain 9 had a filesystem called... also Fossil! — which was based upon Venti content-addressable storage, which stored typed blobs.
Unfortunately, changing this will break compatibility, and since Fossil artifact format was built to last for ages, I don't think it will be changed.
SHA-1 claims
What made me rant about Fossil after congratulating them on switching to SHA3-256 is that they made false claims regarding their use of SHA-1 in the same documentation which shows these clams are false:
Quoting https://www.fossil-scm.org/index.html/doc/trunk/www/hashpoli...:
The SHA1 hash algorithm is used only to create names for artifacts in Fossil (and in Git, Mercurial, and Monotone). It is not used for security. Nevertheless, when the Shattered attack found two different PDF files with the same SHA1 hash, many users learned that "SHA1 is broken". They see that Fossil (and Git, Mercurial, and Monotone) use SHA1 and they therefore conclude that "Fossil is broken". This is not true, but it is a public relations problem. So the decision was made to migrate Fossil away from SHA1.
If you search the docs, you discover that they use SHA-1 for security:
* To store passwords (https://www.fossil-scm.org/index.html/doc/trunk/www/password...)
* In the client-server authentication protocol in an adhoc MAC construction (https://www.fossil-scm.org/index.html/doc/trunk/www/sync.wik...)
Speaking of passwords, the automatically generated passwords are too short: I just created a repo with Fossil v2.1 and got "efc6f5" as initial password. It's 6 hex characters, or just 3 bytes — trivial to crack.
Finally, I as I said, I really like Fossil even though I don't use it anymore for open source projects (I still use it for some private projects) and have a great respect to its author and other contributors. But in my opinion, it needs at least a fundamental but simple change in the storage format to introduce object types.
If something is unclear or you have questions, I'm happy to answer.
(My concern was that the twitter basically contained nothing of any content: no technical details, no link to a blog post, nothing which can be checked or verified... which makes it indistinguishable from insinuation; and so I have to dismiss what you said out of hand.)
(Regarding provocative language: I don't think publicly calling someone out as a liar, in so many words, particularly on a medium like Twitter which doesn't really allow for an effective response, is particularly effective in producing a useful result...)
Anyway:
Re the manifest issue: to paraphrase, to check I'm understanding you correctly: because each manifest refers to its predecessor, and not vice versa, adding any blob which looks like a manifest implicitly adds that manifest to the tree. Normally Fossil trusts authenticated users to add blobs to the tree, because they're authenticated, but ticket attachments can be added by anyone, which effectively means that you can bypass the authentication and commits can be done by anyone. Is that correct?
In which case, yeah, I agree; that's very bad. I can't spot any holes in your reasoning. It is possible to positively identify attachments by looking at their parent manifest, as each one should be pointed at by an A record, so I suppose you could disallow manifests if they're referenced like this, but my gut tells me that's going to be horribly fragile... you're right, adding a type prefix is obviously the right way to go.
If you create a manifest and check it in as a normal file, so it's referenced by an F record, is it still treated as a manifest? If not, could this machinery be extended to attachments as well?
You did bring this up on the mailing list, right?
Users with commit privileges are granted more trust and do have the ability to forge manifests. But as before, there is an audit trail and rogue manifests (and the users that insert them) can be detected and dealt with after the fact.
Structural artifacts have a very specific and pedantic format. You can forge a structural artifact, but you will never generate one by accident during normal software development activities.
The git tag and signing verify logic assume the sha-1 hashes for integrity.
For instance, a mere four byte CRC-32 can reasonably assure integrity in some situations, like when used on sufficiently small payload frames; yet it is not useful as a digest for certifying authenticity.
SHA-1 is suitable for integrity.
> Yes, this is correct. The struct object_id changes don't actually change the hash. What they do, however, is allow us to remove a lot of the hard-coded instances of 20 and 40 (SHA-1 length in bytes and hex, respectively) in the codebase.
My prior comment was explaining why jffry's complaint is nonsensical (a typedef does not prevent moving from a single hash model to a multiple simultaneous hash model).
> You want to have a model that basically reads old data, but that very aggressively approaches "new data only" in order to avoid the situation where you have basically the exact same tree state, just _represented_ differently.
> That way everything "converges" towards the new format: the only way you can stay on the old format is if you only have old-format objects, and once you have a new-format object all your objects are going to be new format - except for the history.
As soon as there is one new-hash commit in a repo, all users of it will have to upgrade their git client - and that git client will (probably?) default to writing new-hash commits.
$ sudo $PKG_MGR upgrade gitI suspect a lot of the tools you mentioned also already treat hashes as strings, not as 160 bit numeric types. The entire front-end JS for GitHub, for example, just uses strings. That's what I'd do if I were writing IDE integrations and such too.
Secondly, the new format will likely still be a 160-bit numeric type, just calculated using a different hash algorithm (e.g. it might be the first 160 bits of the SHA256 result). The tools you mentioned likely don't have to calculate said hashes, they just display them. The entire GitHub front-end, for instance, just displays whatever is given to it; commit hashes are input data to it, not output data.
md5 was first "broken" in 1995. As of 10 years ago, a collision attack took only a few seconds. Plus, there are a _number_ of other attacks on the hash.
64 bits of security is good enough against most non-nation state actors.
Obviously, MD5 (and sha-1) aren't anywhere near perfect hashes. And obviously, you need to look at more than length when judging a hash.
Basically my point was that md5's hash length isn't a big problem.
And he was wrong as openpgp signatures on commits and tags are a thing.
Not sure when that feature was introduced however, I doubt that it existed in the first version of git. That being said he should have changed the hash function the moment that feature was introduced.
I agree that this isn't likely to happen by accident, but Fossil servers are usually public facing, so we have to worry about malice as well.
My point was that the choice that was made was considered good enough for the purposes for which it was intended. In the context of the OP's comment, criticizing git for not making different code design choices doesn't mean that Linus was wrong, it may mean that the OP doesn't know and/or understand all the considerations Linus had. And Linus has said many times that the security of the hash is not the primary consideration in his design.
Git's choice of SHA-1 was not at the time predicated on having the single most unbreakable hash in existence, the hash's use is not for security purposes, and to talk with such incredulity about Linus' choice may be to misunderstand git's design requirements.
There are no practical pre-image attacks for either of them yet. (2^102 for MD4, 2^123 for MD5)
I've run into performance problems with things like MathJaX, which includes thousands (or tens of thousands?) of files as a backup method for rendering equations. (I understand each file has a single character in some typeface.)
64 bits of ideal security is about half the industry accepted security strength in bits for a hash function.
A secure design is essential for trusting this functionality. My trust in Git has always been tempered by the weakness of SHA1.
A GPG signature is no stronger than its object ref.
Have you seen how many frameworks believe "auto-pull and compile deps by hash from github" is reasonable? They are assuming this isn't a massive attack vector. They are trying to build on a core feature that Git claims to have.
Recent events moved this from probably foolish to provably so.
The comments are brought up usually to explain why Linus didn't think much of it at the time, whereas they actually demonstrate the shift of thinking around what Git is meant to provide. Security is definitely a goal now, and the hash function is the critical piece of security infrastructure.
One can check what is used with e.g.
$ git cat-file -p $some_tag | gpg --list-packets | grep "digest algo"
The output is of the form digest algo n, begin of digest xx yy
Where n can be: 1: MD5
2: SHA1
8: SHA256
10: SHA512
(See RFC 4880, 9.4 for all values)I don't think it changes anything though, because of git's integrity. Stop me if I'm getting this wrong but, if you wanted to attack a signed git commit through the gpg signature's hash, you would have to modify the commit object itself... which yields a different commit hash in order to be valid. You'd have to get absurdly lucky to have a signature collision that contains a (valid) commit hash collision.
The rest of the code is sent through mailing lists as patches, so the hash is irrelevant.
SHA1 here protects against "random" corruption (which is more than some VCS do), but not an attacker. At no point one is able to send trusted contributors bad commit objects.
Now, the use people have of git is very different from the kernel (or git) style, so their threat model is different, and SHA1 may become a security function.
(Of course, CLI and frontend tools could still truncate display output to 40 hex characters, but internally full size hashes will be used.)
I'm guessing it's part lack of skill in design, part bad software development tools (uEMACS and makefiles or something), and part just being against c++ et al.
> Of course, I'd also suggest that whoever was the genius who thought it was a good idea to read things ONE FCKING BYTE AT A TIME with system calls for each byte should be retroactively aborted. Who the fck does idiotic things like that? How did they noty die as babies, considering that they were likely too stupid to find a tit to suck on?
He deserves to eat this shit sandwich.
> I wonder how many non-cryptographers knew about SHA-2 back in 2003-2004.
Any systems engineer should have known about SHA-2. SHA-1 only provides 80-bits of security, so everyone else assumed that it would need to be replaced.
I agree that he should've used SHA-2 or better yet, have made the hash algorithm modular, but what does your quote add to the discussion?
Not much, thanks for the gentle reminder :)
object $sha1
type commit
tag $name
tagger $user $timestamp $tz
$text
If you wanted to attack a signed git commit through the gpg signature's hash, you would have to do a second preimage attack on that text with a different commit sha1.OTOH, if you wanted to attack a signed git commit through the git commit sha1, you would have to do a second preimage attack on that commit text, which is of the form:
commit $length\0
tree $sha1
parent $parent_sha1
author $author $author_timestamp $author_tz
committer $committer $committer_timestamp $committer_tz
$text
See where I'm going? it's the same kind of attack.Another way to attack it would be to do a second pre-image attack on the pointed tree, which is harder because there is not really free-form text available in a tree object.
Yet another way to attack it would be to do a second pre-image attack on one of the blobs pointed to by a tree, where the format is of the form:
blob $length\0$content
I don't think this is significantly easier than any of the second pre-image attacks mentioned above.So, in fact, in any case, to attack a gpg signed git tag, you need a second pre-image attack on the hash. If git uses something better than SHA-1, but GPG still uses SHA-1, the weakest link becomes, ironically, GPG.
That being said, second pre-image attacks are pretty much impractical for most hashes at the moment, even older ones that have been deemed broken for many years (like MD5 or even MD4 (TTBOMK)).
That is, even if git were using MD4, you couldn't replace an existing commit, tree or blob with something that has the same MD4.
Edit:
In fact, here's a challenge:
Let's assume that git can use any kind of hash instead of SHA1. Let's assume I have a repository with a single commit with a single tree that contains a single source file.
The source file is:
$ cat hackme.c
#include <stdio.h>
int main() {
printf("Hack me, world!\n");
return 0;
}
So that we all talk about the same thing, here is the raw sha1 for this source: $ sha1sum hackme.c
cffc02c09faf2e9a83ecbb976e1304759868cf1c hackme.c
And its git SHA1: $ git hash-object hackme.c
36134c8c8e9fdf705441dcc1f71736064afc7c44
Here is how you can create this SHA1 without git: $ (echo -e -n blob $(stat -c %s hackme.c)\\x0; cat hackme.c) | sha1sum
36134c8c8e9fdf705441dcc1f71736064afc7c44 -
or $ (echo -e -n blob $(stat -c %s hackme.c)\\x0; cat hackme.c) | openssl sha1
(stdin)= 36134c8c8e9fdf705441dcc1f71736064afc7c44
And for git variants that would be using MD5: $ (echo -e -n blob $(stat -c %s hackme.c)\\x0; cat hackme.c) | openssl md5
(stdin)= 1b56dbc6613ff340b324ca973aec67f9
Or MD4: $ (echo -e -n blob $(stat -c %s hackme.c)\\x0; cat hackme.c) | openssl md4
(stdin)= 0eaabfc1a32629dce98c476f591c3f60
The challenge is this: attack the hypothetical repository using the hash of your choosing[1] ; replace that source with something that is valid C because people using the content of the repository will be compiling the source. Obviously, you'll need the hash to match for "blob $length\0$content" where $length is the length of $content, in bytes, and $content is your replacement C source code.1. let's say, pick any from the list on http://valerieaurora.org/hash.html
I posit you'll spend a lot of time and resources (and money) on the problem, (exponentially more so than Google did with SHAttered) except for Snefru.
But for the commit it's different, because the $text in your example affects the hash of the commit itself. And my understanding is that if you sign the commit, you're signing both the contents and the hash of the content. Am I incorrect?