Git archive checksums may change(github.blog) |
Git archive checksums may change(github.blog) |
Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?
files uploaded to GH Packages are not modified by GitHub.
only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.
if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.
GitHub chooses to do this. It's GitHub's choice to generate Source Code files on demand rather than when the release is made. It's a way of reducing their disk usage at the cost of this kind of potential problem.
The problem is they also presented it as if it was a stable reference. If people knew it was not stable they would have done what the Bazel devs are now talking about doing, which is also uploading the source code at release time, as an artifact (which is how it works on Nexus).
Keep it simple, just vendor your deps.
Github has pretty much a one-click ( or one API call ) workflow to create properly versioned and archived tarballs. Just because lots of people try to skirt proper version management doesn't mean you should commit the world into your repo
How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.
https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...
how? the docs state that the hashes of these files are not guaranteed to be stable.
the decision to generate those files on demand is a good one, provided that the behavior is documented, and it is.
others in this thread figured it out before this particular issue arose and made the necessary changes to their workflows so that their downloads would have stable hashes.
1. You work in a company, you are in a team, you want some reasonable code review process in place. Now you want to check in a 3rd party dependency, "let's vendor it!" so you send out a PR with ... 10,000 - 100,000 lines of code. Your reviewer has no reasonable way to know if a) the dependency was downloaded from a reputable source, if b) the code was not modified maliciously, c) there was some local patch / local change either voluntarily added or accidentally added (maybe you tried running configure/make locally, and didn't realize that one .h file was generated from your machine. A diligent reviewer would have to re-download the source tarball from a reputable source (is there the url in the commit message? A README? better hope there is!), unpack it locally, generate the set of files and all hashes, compare with your PR. And ensure that the PR / vendored dependency comes with a README or METADATA file so the download URL and LICENSE is recorded for posterity.
2. Now you need to update the dependency. Either it's a new directory (so you vendor both versions), or you have to delete all files that are gone. The PR review will be worse, as it will show a diff, except the reviewer won't review it, except to repeat the steps in 1. Without considering patches applied in the mean time, as the code was simply checked in the repository, and anyone could easily change it.
3. For anything but small/tiny projects, the vendoring will take up most of the download / checkout time of your repository.
If you use git for vendoring, the problem is not significantly better: if you care about the integrity of the vendored code, you need to verify the final tree, or the log / hash / set of commits.
Compare to using a simple file with a 1) url, 2) secure hash, 3) list of patches to apply. Reviewing and ensuring correctness is trivial, upgrading is trivial, PRs are trivial.
To avoid problems like the github problem here, a simple proxy or local cache is enough, a tool that takes the hash (or a hash of a url) and reads it from disk, is good enough. And detects corruption.
2. Same as in 1.
3. It’s not an issue in my experience. A much bigger issue are large JSON files captured for snapshot testing or just big binary files. If your repo is so small that its deps are its majority, then it really shouldn’t take all that much time (or you use too many/too big deps, but I doubt you can beat Chromium which has Skia-sized deps).
> Compare to using a simple file with a 1) url, 2) secure hash, 3) list of patches to apply. Reviewing and ensuring correctness is trivial, upgrading is trivial, PRs are trivial.
Using a url doesn’t remove the need for reviewing the code of your dependencies. If you don’t, you’re essentially running “curl | sh” at scale. Checksums without code review don’t mean much.
Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...
I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.
GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.
#15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
#0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
#0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
#0 8.164 npm ERR! code EINTEGRITY
#0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
#0 8.176
#0 8.177 npm ERR! A complete log of this run can be found in:
#0 8.177 npm ERR! /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
#15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1
This was working earlier today and the docker build/package.json haven't changed.``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```
The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula
What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...
However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.
They are probably generated on-demand (and cached) from the Git repository, not prebuilt.
[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.
For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.
GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.
Oh wait, what hash? :clown:
> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.
The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.
But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...
I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.
https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...
> uses 2% less CPU time. That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
> What are the benefits? Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.
It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?
Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.
[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
https://bugzilla.tianocore.org/show_bug.cgi?id=3099
At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:
https://github.com/isaacs/github/issues/1483
I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.
[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...
==> Validating source files with b2sums...
labwc-0.6.1.tar.gz ... FAILED
==> ERROR: One or more files did not pass the validity check!Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?
"didn't read every commit in new version of git, realized after the fact"
[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...
[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...
[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0
On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.
It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.
POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/
Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.
So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.
However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.
Tar/zipball archives on the same ref never have a stable hash.
Forever problem 1:
No sha256/512/3 hashes of said tar/zipballs.
Forever problem 2:
No metalinks for those.
Forever problem 3:
Not IPv6. Some of our network is IPv6 only.
Forever problem 4:
Hitting secondary rate limiting because I can browse fast.
You can try it online here:
and relies on checksumming ephemeral artefacts for integrity.
GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.
You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.
Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.
I know that the Bazel team reached out to GitHub in the past to get a confirmation that this behaviour could be relied on, and only after that was confirmed did they set that as recommendation across their ecosystem.
The hash that pops out of 'git archive' has nothing whatsoever to do with the commit hash and was historically stable more or less by accident: git feeds all files to 'tar' in tree order (which is fixed) and (unless you specify otherwise) always uses gzip with the same options. Since they no longer use gzip but an internal call to zlib, compression output will look different but will still contain the same tar inside.
That people have relied on this archive hash being stable is an indication of a major problem imho, because it might mean that people in their heads project integrity guarantees from the commit hash (which has such guarantees) onto the archive hash (which doesn't have those guarantees). I would suggest randomizing the archive hash on purpose by introducing randomness somewhere, so that people no longer rely on it.
The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.
There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.
the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.
In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.
This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.
I think everyone knows these files are generated on the fly, but it comes from old habits.
Did cache hits save you? Did cache misses break your builds?
looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).
Looks like the author is the maintainer of "Git for Windows", and similar, which I can imagine makes for a reasonable argument for reducing dependencies. zlib is already a library dependency, just use that instead of needing people to bundle up a gzip binary along with git, too.
https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....
Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?
That's without even mentioning the absurdity of saving 2% CPU but still using zlib.
"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"
Not to mention, forcing people to use GitHub releases instead of just tags (which excludes every mirror of somewhere else)
- you use autoconf (or any other tool(s) that require generating code into the source archive; or - you have submodules (to which `git archive` is completely blind).
Note that `git-archive-all`[1] can help as long as your submodules don't do things like `[attr]custom-attr` in their `.gitattributes` as it is only allowed in the top-level `.gitattributes` file and cannot be added to the tree otherwise.
Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of
sha256sum $(find . -type f | sort) | sha256sum
I am not here advocating that everyone switch to this basic directory hash either, because it's not a solution to the more general problem that many systems are solving, namely validating _any_ downloaded file, not just file archives.There are widespread, standard tools to run a SHA256 over a downloaded file, and those tools work on _any_ downloaded file. Essentially every programming language ships with or has easily accessible libraries to do the same. In contrast, there are not widespread, standard tools or libraries for the "NAR Hash" nor the Go "directory hash". Even if there were, such tools would need to be able to parse every kind of file that people might be downloading as part of a build, not just tar files.
It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.
If you adopt Nix fully, the .narinfo file that cache.nixos.org (a Nix substituted) serves that is signed, contains both the NAR Hash and the hash of the NAR Archive File as well. Additionally, NAR packs and unpacks deterministically, and you can read the implementation in the Nix thesis.
A .narinfo file looks like this:
```
StorePath: /nix/store/xvp2wr01fi27j0ycxqmdg6q4frsiv82s-libnotify-0.8.1 URL: nar/0a4jjqxwjcnnaia76l64drq9bjw7jczgmrirzshgp0bnw621f1c9.nar.xz Compression: xz FileHash: sha256:0a4jjqxwjcnnaia76l64drq9bjw7jczgmrirzshgp0bnw621f1c9 FileSize: 24324 NarHash: sha256:02bh3qjxgph5g9di3q553k87w4kbc4drmflkfz9knqbp9jip98c5 NarSize: 101776 References: 7ncncvnr864iangwbvbgbanx1r6wpf79-gdk-pixbuf-2.42.10 i4dqcpppyyq5yqcvw95mv5s11yfyy8pf-glib-2.74.3 xvp2wr01fi27j0ycxqmdg6q4frsiv82s-libnotify-0.8.1 yzjgl0h6a3qh1mby405428f16xww37h0-glibc-2.35-224 Deriver: 2vjs6q5j5vqckcwsvmh5lajvx3p7arkj-libnotify-0.8.1.drv Sig: cache.nixos.org-1:IqCAJROaqNx4TthRv9V47/dM7KP4sR+bBWBfL+9xSqQHAezcfczYdJhKj8nl5l+iFnj8O4uTIJMWNOcwVq8+AA==
```
The case where Nix is not adopted fully is the one I have in mind.
Unfortunately for this kind of service you need to actively fiddle with the bytes to prevent people from relying on an implementation detail like this and prevent them from digging you into a too big to fail api stability hole.
Isn’t that the only humane course given all that depends on this?
This accurately describes my beef with golang
That people use it comes from how releases were usually published (independent of any version control system) as tgz/zip archives on some project website or ftp server. Websites and ftp servers were often mirrored to e.g. ISP or university mirrors because bandwith was scarce and CDNs were expensive/absent. To make sure that your release download from the university of somestrangeplace ftp matches the official release, you would compare the archive hash from the official project website with the hash of the archive you downloaded (bonus points for a GPG signature on the archive).
This then got automated by build/install/package tools to check the package downloaded from some mirror against the hash from the package description. Then GitHub happened, where GitHub replaced the mirror servers, serving autogenerated 'git archive' output instead of static files. And thats where things went wrong here...
the number of times the Microsoft-haters are just straight-up factually wrong in their justifications for their complaints is way too high for me to trust them ever again in my life.
I'd imagine motivation for this change in particular is multiplatform use, not every platform just have gzip in path.
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
This case is different as breakage probably affected github/microsoft themselves
If I have a box with an apple in it, I don't care about the box, I care about the apple inside. If it's not an apple, I don't want to eat it.
For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
The solution here isn’t to change the entire open source ecosystem.
Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.
The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.
I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.
I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.
That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.
I think that if you can't handle a 0.4s delay in a build, you have problem problems.
A) How do you catch tarballs that have extra files injected that aren't part of your manifest
B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.
B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.
And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?
The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)
Firstly SHA is not a secure hash.
Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
What am I missing?
It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).
I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.
> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.
Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!
Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.
[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.
This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.
Sha1 is still used for security purposes anyways, even though it really shouldn't be!
Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.
Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!
So signed commits rely entirely on the security of sha1 for now!
You may have already knew all of this about git signing but I thought it might be interesting to mention.
2) The checksum assures you that the file you have is the same your upstream looked at
2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.
Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.
It would be perfectly fine if you could prevent GitHub from linking the autogenerated archives from the releases or at least distinguish them in a way that makes it clear that they are not immutable maintainer-generated archives.
The nice thing about checksumming the tarball is that once you’ve done so, it doesn’t matter whether you trust GitHub or the HTTPS layer or not.
GitHub and its HTTPS cert provide no protection against a compromised project re‐tagging a repo with malicious source, or even deleting and re‐uploading a stable release tarball with something malicious.
For software distribution this actually sometimes goes the other way - debian/ubuntu uses http (no s) for their packages, because the content itself is signed by the distribution and this way you can easily cache it at multiple levels.
If you can't trust the archive published by the owner themselves, you are already screwed; a stable hash will just make sure that you trust harder that you are, indeed, downloading contaminated code.
I'm not sure most people here understand how checksums/hashs work, what they protect you against, and what they don't.
It isn't only that people don't know what checksums, hashes, and signatures do, it is also problematic that they blindly trust or ignore middlemen. Most supply chain "security" is security theater, almost never is something vetted end-to-end.
“Complicated” is indisputable. Cloning a repository is absolutely complicated. Fetching a single file over HTTPS is as simple as it gets, these days.
Just executing the ./configure will take more than that.
Huh? What I fully believe is that downloading a source tarball over HTTPS, verifying its checksum, and extracting it will take less time than cloning the repository from Git, then verifying the checksum of all files—which you said would take 29 seconds plus 0.4s.