Git archive checksums may change

Git archive checksums may change(github.blog)

245 points by mcovalt 3 years ago | 240 comments

vtbassmatt 3 years ago |

Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

rsc 3 years ago | |

Thanks for the quick rollback.

I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.

GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.

matthewcroughan 3 years ago | | |

I've just had a thought. When GitHub do update the hashing for better compression, everyone relying on the tar hash will update their hashes. This is the ultimate opportunity to change the tar contents, effect the supply chain, introduce vulnerabilities, and have everyone trust you. Something like Nix which computes the NAR Hash (the result of the tar contents) will not be effected by this, since it only cares about the content. I think this is much better than worrying about an unlikely tar vulnerability. In a system that only trusts the tar hashes, the original source is not able to take advantage of better compression over time, without massive risk of supply chain attack. If you think you can hand me a tarball that can run arbitrary code, for any version of tar that has ever existed, please give it to me so I can experiment with exploits, and I'll buy you a drink of your choice at FOSDEM if you're there!

bentley 3 years ago | | |

I would also appreciate stronger advertising of the ability to turn a Git tag into a GitHub release and upload stable source code files to it. Maybe even a button in the GitHub releases interface to “generate source tarball and attach as stable tarball to this release.”

matthewcroughan 3 years ago | | |

https://floxdev.com/blog/hash-collision

vtbassmatt 3 years ago | |

We updated our Git version which made this change for the reasons explained. At the time we didn't foresee the impact. We're quickly rolling back the change now, as it's clear we need to look at this more closely to see if we can make the changes in a less disruptive way. Thanks for letting us know.

phphphphp 3 years ago | | |

Consumers often mistake hasn’t changed for a commitment to never change: any sufficiently large product will be littered with these kind of implicit commitments made by the product to consumers that nobody has visibility into. You’re unfortunate that we were all relying on this commitment you’ve never made, but the quick reversion is the best we can hope for. People will theorise how this could have been avoided but c’est la vie — easy mistake that you’ve responded well to.

mdouglass 3 years ago | |

We are seeing an npm install failure inside our docker builds pointing at a github URL with a SHA change. Is this possibly related?

  #15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
  #0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
  #0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
  #0 8.164 npm ERR! code EINTEGRITY
  #0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
  #0 8.176 
  #0 8.177 npm ERR! A complete log of this run can be found in:
  #0 8.177 npm ERR!     /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
  #15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1

This was working earlier today and the docker build/package.json haven't changed.

andrewguenther 3 years ago | | |

Yes, this is the exact issue being described

voidbip 3 years ago | | |

Just want to second this. Still seeing an issue in our build right now that seems related.

``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```

kris-nova 3 years ago | |

Thanks for the update! There is only 1 internet to watch and learn from. We are all in this together. <3

denom 3 years ago | |

In my particular use-case, I'm using a set of local dev tools hosted as a homebrew tap.

The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula

What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...

lozenge 3 years ago | | |

They never really stored them, they were always generated by some code (maybe with a cache layer in front). The code changed in a way that changed the bytes in the tar.gz without affecting their contents-when-extracted.

crote 3 years ago | | |

The trick here is that a Github release is in essence simply a tag of a specific commit. There is no need to build archives in advance, as they can be dynamically generated from the git repo.

However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.

Denvercoder9 3 years ago | | |

> Does this mean the entire set of zip/tar.gz archives has been rebuilt?

They are probably generated on-demand (and cached) from the Git repository, not prebuilt.

scyrybdis 3 years ago | | |

I think the zip/tar.gz archives are being created on the fly when you download them, probably with a caching layer in front.

vlovich123 3 years ago |

Hyrum's Law strikes again. It kind of doesn't matter what you document. If you weren't randomizing your checksum previously [1], you can't just spring this on the community and blame it for the fallout. I'm more shocked that there's resistance from the GitHub team saying "but we documented this isn't stable". Default stance for the team should be rollback & reevaluate an alternate path forward when the scope is this wide (e.g. only generating the new tarballs for future commits going forward).

[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.

lucb1e 3 years ago |

I didn't even know I should be depending on compression, file ordering, created-at file metadata, etc. being stable when pressing 'download repository as zip' (if I understand correctly what this is about, since the article doesn't really say). Perhaps it could be stable due to caching for a while after you first press it, but when it gets re-generated? I'm very surprised this was reproducible to begin with, given how much trouble other projects have with that.

For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.

slaymaker1907 3 years ago | |

I once had a small issue with a deployment at work because of ordering issues within a zip file. That order is important with Spring since that determines which classes are initialized first.

groestl 3 years ago | | |

One of the first things I check with every jvm packaging/deployment tool I investigate: does it preserve classpath ordering. Some offenders think -jar * is enough.

rfoo 3 years ago | |

> gpg seems to be what all of them use

GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.

Oh wait, what hash? :clown:

leoh 3 years ago | |

Many tools set mtime to zero to avoid checksum drift

philipwhiuk 3 years ago | |

There are lots of methods to solve this problem - I imagine this was just easiest at the time given it appeared to work. Bazel devs on the list are discussing the best approach going forward - a simple change is to upload a fixed copy as a release artifact.

frankjr 3 years ago |

GitHub will need to revert this change. They've just crippled pretty much every "from source" package manager out there.

WayToDoor 3 years ago |

https://github.com/orgs/community/discussions/45830#discussi...

> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

skobovm 3 years ago |

I wonder what monetary loss in productivity was due to this change. We noticed this issue a bit before noon, tracked it down to GH, sent out company-wide comms notifying others of the problem, filed tickets with GH, had to modify numerous repos across multiple teams, and now it's 3pm and I'm here reading about it.

It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.

misnome 3 years ago | |

Our conda-forge package builds broke. We had someone declare to us that tag downloads were never stable, just releases. This seems to be the opposite of the known truth about the previous status quo - but does go some way to demonstrating how little the state of the actual guarantees for this system were understood.

wildfire 3 years ago |

See https://github.com/orgs/community/discussions/45830 for the fallout.

kelnos 3 years ago |

The thing I don't get is how this ever worked.

The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.

But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...

FeepingCreature 3 years ago | |

gzip is 28 years old. I don't think the output changes anymore.

account42 3 years ago | | |

There is no reason to believe that it won't. Even after 28 years, there could be improvements merged for the compressor. Or perhaps especially after 28 years - we have a lot more memory now but it is slower when compared to our CPUs than it used to be so there is most likely room for tuning. Similar for patches that make use of newer CPU instructions - why would you expect them to take care to produce the exact same output rather than just the best compression possible for a perf budget.

jzelinskie 3 years ago |

Does anyone have the motivation for why the git project wants to use their own implementation of gzip? Did this implementation already exist and was being used for something else?

I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.

nemetroid 3 years ago | |

They're still using zlib to do the heavy lifting. It's not a large patch.

https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...

capableweb 3 years ago | | |

> So the internal implementation takes 17% longer on the Linux repo, but

> uses 2% less CPU time. That's because the external gzip can run in

> parallel on its own processor, while the internal one works sequentially

> and avoids the inter-process communication overhead.

> What are the benefits? Only an internal sequential implementation can

> offer this eco mode, and it allows avoiding the gzip(1) requirement.

It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?

semiquaver 3 years ago | |

“Their own” implementation is just zlib, already in use throughout git since the dawn of the project for other purposes like blob storage [1].

Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.

[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

groestl 3 years ago | |

I think "Drop the dependency on gzip" for something like Git trumps a bit more exposure (which can be mitigated with thorough reviews).

Aissen 3 years ago |

It was publicly known that Github was breaking automatic git archives consistency for many years. Here is a bug on a project to stop relying on fake github archives (as opposed to stable git-archive(1)):

https://bugzilla.tianocore.org/show_bug.cgi?id=3099

At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:

https://github.com/isaacs/github/issues/1483

I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.

doubleunplussed 3 years ago |

Ah, this will presumably break some Arch Linux AUR packages. Preparing for bug reports.

elesiuta 3 years ago | |

I always anticipated something like this could happen and it bothered me enough to create my own workflow [1] to archive, hash, and attach it to each release automatically for my AUR package. I can see how most people wouldn't notice/bother with such a small detail though, so I am not at all surprised by the fallout this caused.

[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...

frankjr 3 years ago | |

Yep, it has already broken labwc for me.

    ==> Validating source files with b2sums...
        labwc-0.6.1.tar.gz ... FAILED
    ==> ERROR: One or more files did not pass the validity check!

lopkeny12ko 3 years ago |

I can't fathom how no one internally at Microsoft-Github realized how widespread the breakage would be before rolling this out to all public users.

Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?

ilyt 3 years ago | |

I can

"didn't read every commit in new version of git, realized after the fact"

medellin 3 years ago |

Im thinking of all the bazel build rules that are about to break from my last company. Someone will have a fun day updating hundreds of hashes.

ErikCorry 3 years ago | |

Do they let Github generate the archives as one of the build rules instead of performing the archival and compression locally and uploading the result?

medellin 3 years ago | | |

Correct. Silly stuff like this happens when you don’t have systems in place that make it easy to store your own artifacts. Additionally a lot of people just want to get things done as quick as possible even if you have the tools in place.

jart 3 years ago | |

If they're using multiple URLs like a good Bazel user then they shouldn't be impacted.

thirtyseven 3 years ago | | |

The setup instructions for almost [1] every [2] major [3] rule set [4] only provide one (GitHub) url in the Starlark blob you're supposed to copy and paste, so hard to blame users here.

[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...

[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...

[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0

[4] https://github.com/bazelbuild/rules_scala

medellin 3 years ago | | |

They did where applicable but i know that not all of them had multiple

UncleOxidant 3 years ago |

Lol... I was being burned by this just about an hour ago. Cloned a repo, did a build of the project (which uses bezel to fetch dependencies) and it reported errors due to mismatch in expected checksums.

hamandcheese 3 years ago |

The fact that this is causing problems seems like a flaw in Bazel, imo. Nix, for example, calculates a hash of the contents of a tarball, rather than a hash of the tarball itself.

rfoo 3 years ago | |

Yep, Nix not affected at all is pretty impressive.

On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.

Foxboron 3 years ago | | |

They don't really do any source authentication at all. There is no strategy for checking gpg/minisign/whatever signatures and fetching keys to validate these things.

ArchOversight 3 years ago |

I remember a similar breakage happening before due to internal git changes, and thought it was common knowledge to upload your own signed tarballs for releases.

rektide 3 years ago |

Now please give us compression options beyond gzip? :) Some zstd & lz4 please?

metrognome 3 years ago |

I wonder if this incident will encourage our industry to build more robust forms of artifact integrity verification, or if we will instead codify the status quo of "we guarantee repos to be archived deterministically." To me, the latter seems like a more troubling precedent.

bentley 3 years ago | |

We’ve regressed from the previous norm of open source projects providing stable source tarballs with fixed checksums, sometimes even with cryptographic signatures.

reindeerer 3 years ago | | |

That norm still exists, and it's offered by Github in form of Github Releases feature as well.

It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.

rswail 3 years ago | |

This is being driven in industry by the push by US FedGov (via NIST) to have supply chain verification after the recent hacks.

POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/

Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.

So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.

metrognome 3 years ago | | |

Oh, I'm not arguing that using checksums, SHA for example, for integrity verification is a bad idea. That's what they're designed for, after all.

However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.

swarfield 3 years ago |

https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

1letterunixname 3 years ago |

Forever problem 0:

Tar/zipball archives on the same ref never have a stable hash.

Forever problem 1:

No sha256/512/3 hashes of said tar/zipballs.

Forever problem 2:

No metalinks for those.

Forever problem 3:

Not IPv6. Some of our network is IPv6 only.

Forever problem 4:

Hitting secondary rate limiting because I can browse fast.

fomine3 3 years ago |

I haven't aware that git archive is reproducible

pabs3 3 years ago |

I note that diffoscope is useful for verifying which parts of git/other archives have changed:

https://diffoscope.org/

You can try it online here:

https://try.diffoscope.org/

swarfield 3 years ago |

They have broken almost every open source project that builds external deps. Also broke homebrew apparently.

capableweb 3 years ago | |

Good test that the tooling actually works when the checksums are incorrect :) If your "build from source" tool/workflow DIDN'T break, I'd be worried.

groestl 3 years ago | |

> every open source project that builds external deps

and relies on checksumming ephemeral artefacts for integrity.

catiopatio 3 years ago | | |

Source archives have never, in the entire history of open source, been considered ephemeral.

GitHub unilaterally made that decision for their own convenience, and violated a decades-long universal community norm in the process.

pxc 3 years ago | | |

Such tools should definitely checksum package sources lol

robomc 3 years ago |

Think this also broke github codespaces (the downloading of devcontainer "features").

jakeogh 3 years ago |

Github support, please checkout: https://news.ycombinator.com/item?id=34606345

philipwhiuk 3 years ago |

Yet another reason why GitHub is not a good Artifactory/Nexus replacement.

Anyone remember the crazyness when Homebrew had problems with using GitHub for the same thing?

naikrovek 3 years ago | |

this is a git behavior, not a GitHub behavior.

files uploaded to GH Packages are not modified by GitHub.

only the "Source Code (.zip)" and "Source Code (.tgz)" files that are part of releases and tags are affected because git generates them on demand, and git does not guarantee hash stability.

if you upload a package to GH Packages or upload a release asset to a GitHub releases those are never modified, and you can rely on those hashes.

philipwhiuk 3 years ago | | |

No, it's not.

GitHub chooses to do this. It's GitHub's choice to generate Source Code files on demand rather than when the release is made. It's a way of reducing their disk usage at the cost of this kind of potential problem.

The problem is they also presented it as if it was a stable reference. If people knew it was not stable they would have done what the Bazel devs are now talking about doing, which is also uploading the source code at release time, as an artifact (which is how it works on Nexus).

blcknight 3 years ago |

Oh god I spent like an hour debugging why gpg wouldn’t recognize the signature of RVM (Ruby version manager)

forgotpwd16 3 years ago |

Can anyone explain what happened? Thing changed, things broke, and things changed back in less than an hour.

zoobab 3 years ago |

Github devs cannot point to their git commit, because Github is not open source.

yakubin 3 years ago |

Now I’m having a laugh at all those times someone tried to explain to me that vendoring dependencies doesn’t make sense, when you have package managers which verify checksums of the things downloaded from GitHub/wherever. A good laugh.

Keep it simple, just vendor your deps.

reindeerer 3 years ago | |

This is a false choice. "Vendoring" is much more of a mess than this is, and second, there's no reason to rely on these on the fly tarballs for anything, when proper versioned software releases exist.

Github has pretty much a one-click ( or one API call ) workflow to create properly versioned and archived tarballs. Just because lots of people try to skirt proper version management doesn't mean you should commit the world into your repo

DoctorNick 3 years ago | |

With what? The abomination that is `git submodules`?

yakubin 3 years ago | | |

No. Just copy files into the repo. Any way you like. In a GUI, in a terminal — it doesn’t require a dedicated tool. Although cargo in Rust e.g. provides a dubcommand for it (cargo vendor). Alternatively you can host the tarballs somewhere you control in static storage — be it a static web server, object storage or whatever.

How it’s done in Chromium: <https://source.chromium.org/chromium/chromium/src/+/main:thi...>.

SuperSandro2000 3 years ago |

Thats why nix unpacks the archives first and then hashes them.

gray_-_wolf 3 years ago |

Did people not know this? Honest question. I did run into this few times already before this change, so I assumed this would be wide-spread knowledge and mirrored everything.

skobovm 3 years ago | |

How would anyone (outside of GH) have known this? The checksums have been stable for years, and this issue resulted from an internal update to the version of Git being used. It also was not publicized, until this ex post facto blog post

anecdotal1 3 years ago | | |

They have not been stable

https://github.com/freebsd/freebsd-ports/commit/a43ec88422ee...

mhitza 3 years ago | |

https://xkcd.com/1053/

daniealapt 3 years ago |

Any change breaks a workflow - https://xkcd.com/1172/

capableweb 3 years ago | |

True, small percentage will always be impacted by even the tiniest of change. But this was not that, checksums all over the place started breaking, as lots of FOSS is hosted on GitHub and lots of infrastructure depends on checksums remaining the same, otherwise they error out (correctly).