Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.
Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.
I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.
Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/
The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.
I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.
I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.
We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.
On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.
Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)
> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.
> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.
I use SolidWorks PDM at work to control drawings, BOMs, test procedures, etc. In all honesty, PDM does an alright job when it works, but when I have problems with our local server, all hell breaks loose and worst case, the engineers can't move forward.
In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."
All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?
I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770
If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.
Still, the devil is in the details of building all revisions of all software a single distribution has.
I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.
Honestly, I sort of looked at it for conventional backup strategy...as in, i wonder if it could work as a replacement for tar-zipping up a directory, etc. But, not sure if the use cases is appropriate.
Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:
git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.
If you have a lot of binary assets like artwork or huge excel spreadsheets, it's very useful, because in those cases, without git-lfs, the git repo will get very large, git will get extremely slow, and github will get angry at you for having too large a repo.
But it's not all roses with git-lfs, since now you're reliant on the external network to do checkouts, vs having fetched everything at once w/ the initial clone, and also of course just switching between revisions can get slower since you're network-limited to fetch those large files. (And though I'm not sure, it doesn't seem like git-lfs is doing any local caching.)
So you could imagine where something like having elfshaker embedded in the repo and integrated as a checkout filter could potentially be a useful alternative. Basically an efficient way to store binaries directly in the repo.
(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)
Please see our new applicability section which explains the result in a bit more detail:
https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...
In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!
In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.
The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).
As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.
Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?
> Most of them don't change very often so there are a lot of duplicate files,
> When they do change, the deltas of the [binaries] are not huge.
We need this but for node_modules
Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)
It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.
[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
It might make sense to check how they do it.
I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.
Did you know of their existence when making elfshaker?
Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...
* There are many files,
* Most of them don't change very often,
* When they do change, the deltas of the binaries are not huge.
So, if the image files aren't changing very much, then it might work well for you. If the images are changing, their binary deltas would be quite large, so you'd get a compression ratio somewhat equivalent to if you'd concatenated the two revisions of the file and compressed them using ZStandard.
But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.
Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.
manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.
You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.
[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...
Right now git lfs takes up so much space when storing files locally.
There is also a usability difference: elfshaker stores data in pack files, which are more easily shareable. Each of the pack files released as part of manyclangs ~100 MiB and contains enough data to materialize ~2,000 builds of clang and LLVM.
I'm not sure the linking step they provide is deterministic/hermetic, if it is that would prove a decent way to compress the final binaries while shaving most of the compilation time. Maybe the manyclangs repo could store hashes of the linked binaries if so?
I'm not seeing any particular tricks done in elfshaker itself to enable this, the packfile system orders objects by size as a heuristic for grouping similar objects together and compresses everything (using zstd and parallel streams for, well, parallelism). Sorting by size seems to be part of the Git heuristic for delta packing: https://git-scm.com/docs/pack-heuristics
I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...
These ROM-set archives — especially when using more modern compression algorithms, like LZMA/7zip — end up about 1.1x the size of a single one of the contained game ROM images, despite sometimes containing literally hundreds of variant images.
I appreciate your response, and thanks very much for the clarification of use case; very helpful! Thanks also of course for building this!
manyclangs is optimized to provide you with a binary quickly. The binary is not necessarily itself optimized to be fast, because it's expected that a developer might want to access any version of it for the purposes of testing whether some input manifests a bug or has a particular codegen output. In that scenario, it's likely that the developer is able to reduce the size of the input such that the speed of the compiler itself is not terribly significant in the overall runtime. Therefore, I don't see LTO for manyclangs as such a significant win. But it is still hoped that the overall end-to-end runtime is good, and the binaries are optimized, just not with LTO.
As an idea of how elfshaker performs, we see ~300ms time to create a snapshot for clang, and ~seconds-to-minute to create a binary pack containing thousands of revisions. Extraction takes less than a second. One difference of elfshaker compared with some other software I tested is that we do the compression and decompression in parallel, which can make a very big difference on today's many-core machines.
`bupstash` lacked good compression. I settled with `borg` because I could use `zstd` compression with it. Currently at 60 snapshots of the directory and the `borg` repo directory is at ~1.52GB out of 5GB quota. The source directory is ~12.19GB uncompressed. Very happy with `borg` + `zstd` and how they handle my scenario.
I liked `bupstash` a lot, and the author is responsive and friendly. But I won't be giving it another try until it implements much more aggressive compression compared to what it can do now. It's a shame, I really wanted to use it.
I do recognize that for many other scenarios `bupstash` is very solid though.
We've tweaked the readme, I hope it's clearer.
It would be great to provide this for gcc too. The project is new and we've just started out. I know less about gcc's build system and how hard it will be to apply these techniques there. It seems as though it should be possible though and I'd love to see it happen.
To infer the object->executable dependencies we currently read the compilation database and produce a stand-alone link.sh shell script, which gets packaged into each manyclangs snapshot.
A copy of Super Mario Advance 2 for Game Boy Advance, which is also a re-release of Super Mario World, almost surely uses its own engine and would not be part of the same rom set. Likewise, other Mario games (like Mario 64, Super Mario Bros, etc.) would not be part of the same rom set. So it's nothing about the series using the same engine code or assets.
We're talking bugfixes and different regions for the same game on the same console. But this still has the effect of dropping the size for complete console collections by 50% or more, because most consoles have 2-3 regions per game for most games.
Sometimes, ROM-image-based game titles were based on the same "engine" (i.e. the same core set of assembler source-files with fixed address-space target locations, and so fixed locations in a generated ROM image), but with a few engine modifications, and entirely different assets.
In a sense, this makes these different games effectively into mutual "full conversion ROMhacks" of one-another.
You'll usually find these different game titles compressed together into the same ROMset (with one game title — usually the one with the oldest official release — being considered the prototype for the others, and so naming the ROMset), because they do compress together very well — not near-totally, the way bugfix patches do, but adding only the total amount to the archive size that you'd expect for the additional new assets.
Well-known examples of this are Doki Doki Panic vs. Super Mario Bros 2; Panel de Pon vs. Tetris Attack; Gradius III vs. Parodius; and any game with editions, e.g. Pokemon or Megaman Battle Network.
But there are more "complete" examples as well, where you'd never even suspect the two titles are related, with the games perhaps existing in entirely-different genres. (I don't have a ROMset library on-hand to dig out examples, but if you dig through one, you'll find some amazing examples of engine reuse.)
If you knew where in the ROM image the level data was contained, you could modify it. As long as you didn't violate any constraints, the game would run fine.
You could also potentially influence game behavior as well.
The Game Genie and Gameshark were kind based on this concept. Except, being further along the chain, it could write values coming into and out of memory, so other effects were possible.
So, in the case of Super Mario Bros. ROMHacks, they all use Super Mario Bros. as a base ROM. Then from there, all you need to do is store the diff from the base.
Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.
On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.
Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])
I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)
I remember reading about this technique years ago, thinking “cool when this catches on, software updates in all my software will be tiny”. But no, for some reason macos updates are still gigabytes in size. I have no idea why?
https://github.com/elfshaker/elfshaker#applicability
In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.
elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.
[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)
Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.
In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.
The major distinction (besides this being generalized to pluggable binary container formats) is that what Courgette does happens in the context of delta comparison between two known binaries, where the goal is to create a minimal delta patch that can move you from one known binary to another known binary (given a sufficiently-smart updater, which actually "contains" a lot of the information in a Kolmogorov-complexity sense.)
There are efficiencies you can gain in encoding, that only work in a one-to-one, known-to-known comparison context. (Think "interstitial frames" in video; or RLE'd 1-bit delta-sigma audio encoding.) In these contexts, you need known1 on-hand already, plus a patch specific to the pair of {known1, known2}, to be able to recover known2. (This is why, in streaming video where non-delta-compressed "keyframes" are rare, you can't seek to arbitrary locations in the stream without the decoded video turning into a garbled mess for a while: you're receiving patches, but you don't have the [correct, clean] known1s to apply them to.)
But these efficiencies don't apply in the context of a system like git, where you might be checking out any arbitrary version of the data at any time, with any [or no!] other arbitrary version(s) already cached or checked out. To enable people to efficiently grab the delta they need to do the particular, arbitrary version-to-version jumps they need to do, you'd either need to generate the power-set of all updates; or at least enough entries to populate a skip-list (as Microsoft apparently does for Windows updates! [1]).
The powerset of updates is impractical to generate+store, but gets clients the ability to perform arbitrary jumps in O(1) time. With a skip-list of updates, on the other hand, either the server or the client needs to compute its way through all the skips, to combine them into one final update — so each jump would take O(log N) "patch application" computation steps to resolve. (This is why it takes the Microsoft servers a while to figure out which Windows updates your computer needs!) Neither approach really makes sense for a system like Git, where Git repos are entirely mirrored to every client (so the clients would need to store all those update patches), and where git checkouts are something you're supposed to be able to do often and quickly, in the middle of your daily workflow.
Meanwhile, the sort of approach I outlined would be optimized instead for storing currently known binaries, in a way most amenable to inter-compression against future, unknown versions of the binary, without the need to store any explicit known-to-known patches, or to compute the composition of such patches.
Rather than seeking to compress together existing available things as well as possible, a known-to-unknown compression approach seeks to segment and normalize the data in such a way that future revisions of the same data, put through the same process, would generate lots of content-identical duplicate component objects, which would end up being dedup'ed away when stored in a Content-Addressable Store.
It's the difference between "meeting in the middle" and canonicalization. Levenshtein distance vs. document fingerprinting. RAID5 vs. a fountain-codec-encoded dataset. Or, to be very literal, it's streaming-window Huffman coding, vs. merkle-trie encoding.
This "normalize, to later deduplicate" known-to-unknown compression approach, is the approach that Git itself is built on — but it only works well when the files Git works with are decomposed into mostly-independent leaf-node chunks. My proposed thought here, is just that it wouldn't be impossible to translate other types of files, into "mostly-independent leaf-node chunks" at commit time, such that Git's approach would then be applicable to them.
[1] https://devblogs.microsoft.com/oldnewthing/20200212-00/?p=10...