Elfshaker: Version control system fine-tuned for binaries

Elfshaker: Version control system fine-tuned for binaries(github.com)

599 points by jim90 4 years ago | 113 comments

nh2 4 years ago |

I experimented with something similar with a Linux distribution's package binary cache.

Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.

Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.

Details: https://github.com/NixOS/nixpkgs/issues/89380

peterwaller-arm 4 years ago | |

Author here. I've used bup, and elfshaker was partially inspired by it! It's great. However, during initial experiments on this project I found bup to be slow, taking quite a long time to snapshot and extract. I think this could in principle be fixed in bup one day, perhaps.

nh2 4 years ago | | |

I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.

ybkshaw 4 years ago | | |

Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.

Siira 4 years ago | | |

Is elfshaker any good for backuping non-text data?

iTokio 4 years ago | |

interesting design document: https://raw.githubusercontent.com/bup/bup/master/DESIGN

mhx77 4 years ago |

Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.

peterwaller-arm 4 years ago | |

Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.

pdimitar 4 years ago | |

That's super impressive, I will definitely give it a go. Thanks for sharing!

thristian 4 years ago |

This seems very much like the Git repository format, with loose objects being collected into compressed pack files - except I think Git has smarter heuristics about which files are likely to compress well together. It would be interesting to see a comparison between this tool and Git used to store the same collection of similar files.

peterwaller-arm 4 years ago | |

An author here, I agree! The packfile format is heavily inspired by git, and git may also do quite well at this.

We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.

On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.

johnyzee 4 years ago | | |

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

3np 4 years ago | | |

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)

veselink1 4 years ago |

An author here, we've opened a Q&A discussion on GitHub: https://github.com/elfshaker/elfshaker/discussions/58.

wlll 4 years ago |

Related, and impressive: https://github.com/elfshaker/manyclangs

> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.

> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.

mal10c 4 years ago |

This project reminded me of something I've been looking for for a while - although it's not exactly what I'm looking for...

I use SolidWorks PDM at work to control drawings, BOMs, test procedures, etc. In all honesty, PDM does an alright job when it works, but when I have problems with our local server, all hell breaks loose and worst case, the engineers can't move forward.

In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."

All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?

I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770

kvnhn 4 years ago | |

Maybe not precisely what you want, but I built a CLI tool[1] that's like a simplified and decoupled Git-LFS. It tracks large files in a content-addressed directory, and then you track the references to that store in source control. Data compression isn't a top priority for my tool; it uses immutable symlinks, not archives.

[1]: https://github.com/kevin-hanselman/dud

henvic 4 years ago |

Interesting. I wonder if this can also be [ab]used to, say, deliver deltas of programs, so that you can have faster updates, but maybe it doesn't make sense.

https://en.wikipedia.org/wiki/Binary_delta_compression

peterwaller-arm 4 years ago | |

Author here, I don't think it would apply well to that scenario. elfshaker is good for manyclangs where we ship 2,000 revisions in one file (pack), so the cost of individual revision is amortized. If one build of llvm+clang costs you some ~400 MiB; a single elfshaker pack containing 2,000 builds has an amortized cost of around 40kiB/build. But this amazing win is only happening because you are shipping 2,000 builds at once. If you wanted to ship a single delta, you can't compress against all the other builds.

necovek 4 years ago | | |

How fast would it be to get a delta between any two of the 2,000 builds in a single elfshaker pack?

If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.

Still, the devil is in the details of building all revisions of all software a single distribution has.

henvic 4 years ago | | |

Thank you for the insight!

tedunangst 4 years ago | |

This is how openbsd binary patches for the kernel work. The changed object files are shipped; the end system relinks a new kernel.

carlmr 4 years ago |

I find the description a bit confusing, is there and example where we can see the usage?

w0m 4 years ago | |

My top level being that it's a VCS (like Git) specialized for binaries; with commands baked in to prevent the slowdown that often comes with large git repositories.

throw_away 4 years ago | | |

Specifically, it's for ELF binaries built in such a way that adding a new function or new data does not break however they cache existing functions/data.

I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.

peterwaller-arm 4 years ago | |

One of the authors here, thanks for the feedback. We've tried to improve it here: https://github.com/elfshaker/elfshaker/pull/59

mxuribe 4 years ago | |

Same here. There is a usage guide, which helped a tiny bit: https://github.com/elfshaker/elfshaker/blob/main/docs/users/...

Honestly, I sort of looked at it for conventional backup strategy...as in, i wonder if it could work as a replacement for tar-zipping up a directory, etc. But, not sure if the use cases is appropriate.

peterwaller-arm 4 years ago | | |

Author here. We'd love this to be a thing, but this is young software, so we don't recommend relying on this as a single way of doing a backup for now. Bear in mind that our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs).

xdfgh1112 4 years ago | | |

For backup you probably want something like Borg to handle deduplication of identical content between backups.

wyldfire 4 years ago | |

There is an associated presentation on manyclangs at LLVM dev meeting. I think they presented yesterday?

Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:

https://www.youtube.com/c/LLVMPROJ

ot 4 years ago | |

I would guess it’s a way to quickly bisect on compiler versions.

dilap 4 years ago |

Huh, interesting, could you maybe use this as an in-repo alternative to something like git-lfs?

peterwaller-arm 4 years ago | |

Author here, I don't currently know how this compares to git-lfs. It it is possible git-lfs would perform quite well on the same inputs as elfshaker works on. If git-lfs does already work well for your use case I'd recommend using that rather than elfshaker, as it is more established.

dilap 4 years ago | | |

Thanks for the response! I was more just curious about future possibilities vs immediate practicle use.

git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.

If you have a lot of binary assets like artwork or huge excel spreadsheets, it's very useful, because in those cases, without git-lfs, the git repo will get very large, git will get extremely slow, and github will get angry at you for having too large a repo.

But it's not all roses with git-lfs, since now you're reliant on the external network to do checkouts, vs having fetched everything at once w/ the initial clone, and also of course just switching between revisions can get slower since you're network-limited to fetch those large files. (And though I'm not sure, it doesn't seem like git-lfs is doing any local caching.)

So you could imagine where something like having elfshaker embedded in the repo and integrated as a checkout filter could potentially be a useful alternative. Basically an efficient way to store binaries directly in the repo.

(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)

0942v8653 4 years ago |

Does it do any architecture-specific processing, i.e. BCJ filter? Or is there a generic version of this? The performance seems quite good.

peterwaller-arm 4 years ago | |

Author here. No architecture specific processing currently. Most of the magic happens in zstandard (hat tip to this amazing project).

Please see our new applicability section which explains the result in a bit more detail:

https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...

In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!

beagle3 4 years ago | | |

Thanks, great project!

In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.

mrich 4 years ago |

I'm guessing this does not yield that high compression for release builds, where code can be optimized across translation units? Likewise when a commit changes a header that is included in many cpps?

peterwaller-arm 4 years ago | |

Author here. The executables shipped in manyclangs are release builds! The catch is that manyclangs stores object files pre-link. Executables are materialized by relinking after they are extracted with elfshaker.

The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).

As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.

zeotroph 4 years ago | | |

Thank you for the explanation, so the pre-link storage is one of the magical ingredients, maybe mention this as well in the README?

Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?

mrich 4 years ago | | |

Thanks. I had a use case in mind where LTO is enabled. Unfortunately the LTO step is quite expensive so relinking does not seem like a viable option. If I find some time I'll give it a try though.

lxe 4 years ago |

> There are many files,

> Most of them don't change very often so there are a lot of duplicate files,

> When they do change, the deltas of the [binaries] are not huge.

We need this but for node_modules

ithkuil 4 years ago | |

The novel trick here is splitting up huge binary files and treat them as if they were many small files.

Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)

aabbcc1241 4 years ago | |

checkout pnpm, it stores each version of package only once, and setup your project's node_modules with symbolic to the exact cached version

cerved 4 years ago | |

for what reason?

lxpz 4 years ago |

This should be integrated with Cargo to reduce the size of the target directories which are becoming ridiculously large.

peterwaller-arm 4 years ago | |

Author here. I'm unsure whether this would apply very well to cargo or not. If it has lots of pre-link object files, then maybe.

londons_explore 4 years ago |

I'd like to see a version of this built into things like IPFS.

It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.

hcs 4 years ago | |

It should be possible to do this in IPFS already if you use the go-ipfs --chunker option with a content-sensitive chunking algorithm like rabin or buzhash [1]. With this there's a good chance that a file with small changes from something already on IPFS will have some chunks that hash identically, so they'll be shared.

[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

londons_explore 4 years ago | | |

But that isn't quite as good as something like this that can 'understand' diffs in files, rather than simply relying on the fact a bunch of bytes in a row might be the same.

jankotek 4 years ago |

Does it make a sense to turn it into fuse fs, with transparent deduplication?

peterwaller-arm 4 years ago | |

Author here. Maybe, it's a fun idea. I have toyed with providing a fuse filesystem for access to a pack but my time for completing this is limited at the moment.

nh2 4 years ago | | |

Many packfile-deduplicating backup tools (bup, kopia, borg, restic) can mount the deduplicated storage as FUSE.

It might make sense to check how they do it.

I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.

Did you know of their existence when making elfshaker?

Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...

ghoul2 4 years ago |

If I already have, lets say a 100MB pack file containing (say) 200 builds of clang and then I import the 201st build into that pack file - is it possible to send across a small delta of this new, updated pack file to someone else who already had the older pack file (with 200 builds) such that they can apply the delta to the old pack and get the new pack containing 201 builds?

splittingTimes 4 years ago |

In general, how do you handle repos that have a combination of source files and a lot of binary data side by side?

bogwog 4 years ago |

Does this work well with image files? (PNG, JPEG, etc)

peterwaller-arm 4 years ago | |

Author here, it works particularly well for our presented use case because it has these properties:

* There are many files,

* Most of them don't change very often,

* When they do change, the deltas of the binaries are not huge.

So, if the image files aren't changing very much, then it might work well for you. If the images are changing, their binary deltas would be quite large, so you'd get a compression ratio somewhat equivalent to if you'd concatenated the two revisions of the file and compressed them using ZStandard.

IceWreck 4 years ago | | |

Please add these points under a usecase heading in your README.

shp0ngle 4 years ago | | |

Ahhh that’s the key insight I have been missing, and that should be higher somewhere.

Thanks

i_like_waiting 4 years ago |

Thanks, seems like that could be good solution for storing of daily backups of DB. I didn't know I needed it but seems like I do.

peterwaller-arm 4 years ago | |

Author here, this software is young, please don't use it for backups!

But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.

wpietri 4 years ago | | |

In this age of rampant puffery, it's so... soothing to see somebody be positive and frank about the limits of their creation. Thanks for this and all your comments here!

the_duke 4 years ago | |

Borg, bup or restic are relatively popular incremental backup tools that reduplicate with chunking.

phil294 4 years ago | |

Have a look at Borg, it handles incremental backups very well

tttsxhub 4 years ago |

Why does it depend on the CPU architecture?

peterwaller-arm 4 years ago | |

(Disclosure: I work for Arm, opinions are my own)

Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.

manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.

You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...

999900000999 4 years ago |

Would love to see this work with Unity projects.

Right now git lfs takes up so much space when storing files locally.

erichocean 4 years ago |

Seems like the Nix people would be interested in enabling this kind of thing for Nix packages…

yincrash 4 years ago |

Could this be useful for packing xcode's deriveddata folder for caching in ci builds?

xpe 4 years ago |

Never shake a baby elf!

cyounkins 4 years ago |

Cool! I wonder how this would compare to ZFS deduplication.

veselink1 4 years ago | |

An author here. elfshaker uses per-file deduplication. When building manyclangs packs, we observed that the deduplicated content is about 10 GiB in size. After compression with `elfshaker pack`, that comes down to ~100 MiB.

There is also a usability difference: elfshaker stores data in pack files, which are more easily shareable. Each of the pack files released as part of manyclangs ~100 MiB and contains enough data to materialize ~2,000 builds of clang and LLVM.

goodpoint 4 years ago |

I'm surprised nobody mentioned git-annex. It does the same using git for metadata. It's extremely efficient.

kristjansson 4 years ago | |

AFAIK, git-annex doesn't address address sub-file deduplication/compression at all, it just stores a new copy for each new hash it sees? I suppose that content-addressed storage, combined with the pre-link strategy discussed elsewhere for the related manyclangs project would produce similar, if less spectacular, results?

billconan 4 years ago |

so how do I do diff and merge (resolve conflicts)?

axismundi 4 years ago |

does it work on intel macs?

svilen_dobrev 4 years ago |

will some of these work for (compressed) variants of audio? They're never same..

peterwaller-arm 4 years ago | |

Author here. Compressed data is unlikely to work well in general, unless it never changes.