Dissecting the gzip format (2011)

Dissecting the gzip format (2011)(infinitepartitions.com)

113 points by solannou 1 year ago | 55 comments

Filligree 1 year ago |

One of my favorite gzip party tricks is that (ungzip (cat (gzip a) (gzip b))) == (cat a b). That is to say, the concatenation of two gzip streams is still a valid gzip file.

This hasn't ever been practically useful, but it means you can trivially create a 19-layer gzip file containing more prayer strips than there are atoms in the universe, providing a theological superweapon. All you need to do is write it to a USB-stick, then drop the USB-stick in a river, and you will instantly cause a heavenly crisis of hyperinflation.

mbreese 1 year ago | |

In bioinformatics we use a modified gzip format called bgzip that exploits this fact heavily. The entire file is made of concatenated gzip chunks. Each chunk then contains the size of the chunk (stored in the gzip header). This lets you do random access inside the compressed blocks more efficiently.

Sadly, the authors hard coded the expected headers so it’s not fully gzip compatible (you can’t add your own arbitrary headers). For example, I wanted to add a chunk hash and optional encryption by adding my own header elements. But as the original tooling all expects a fixed header, it can’t be done in the existing format.

But overall it is easily indexed and makes reading compressed data pretty easy.

So, there you go - a practical use for a gzip party trick!

egorpv 1 year ago | | |

In such case, imo, it would be more appropriate to use ZIP container format which supports multiple independent entries (files) and index table (including sizes) for them. Compression algorithm is essentially the same (Deflate) so it would not be bloated in any way. As for practical implementations of serializing complex "objects" into ZIP, numpy [0] can be mentioned as an example.

[0] https://numpy.org/doc/stable/reference/generated/numpy.savez...

Arkanosis 1 year ago | | |

On top of enabling indexing, it reduces the amount of data lost in the event of data corruption — something you get for free with block-based compression algorithms like BWT-based bzip2 but is most of the time missing from dictionary-based algorithms like LZ-based gzip.

I don't think many people use that last property or are even aware of it, which is a shame. I wrote a tool (bamrescue) to easily recover data from uncorrupted blocks of corrupted BAM files while dropping the corrupted blocks and it works great, but I'd be surprised if such tools were frequently used.

ynik 1 year ago | |

Even cooler is that it's possible to create an infinite-layer gzip file: https://honno.dev/gzip-quine/

noirscape 1 year ago | |

Is that by any chance related to how the TAR format was developed?

Considering the big thing with TAR is that you can also concatenate it together (the format is quite literally just file header + content ad infinitum; it was designed for tape storage - it's also the best concatenation format if you need to send an absolute truckloads of files to a different computer/drive since the tar utility doesn't need to index anything beforehand), making gzip also capable of doing the same logic but with compression seems like a logical followthrough.

Suppafly 1 year ago | | |

I assume tar came first just for grouping things together and then compression came out and they were combined together. That's why the unix tarballs were always tar.gz prior to gz having the ability to do both things.

fwip 1 year ago | |

We use it at $dayjob to concatenate multi-GB gzipped files. We could decompress, cat, and compress again, but why spend the cycles?

Joker_vD 1 year ago | |

> This hasn't ever been practically useful,

I used it a couple times to merge chunks of gzipped CSV together, you know, like "cat 2024-Jan.csv.gz 2024-Feb.csv.gz 2024-Mar.csv.gz > 2024-Q1.csv.gz". Of course, it only works when there is no column headers.

wwader 1 year ago | |

I think the alpine package format do use this in combination with tar being similar https://wiki.alpinelinux.org/wiki/Apk_spec

Hakkin 1 year ago | |

This is true for a few different compression formats, it works for bzip2 too. I've processed a few TBs of text via `curl | tar -xOf - | bzip2 -dc | grep` for tar files with lots of individually compressed bz2 files inside.

hansvm 1 year ago | |

It's kind of nice whenever you find yourself in an environment where, for whatever reason, you need to split a payload before sending it. You just `ungzip cat *` the gzipped files you've collected on the other end.

blibble 1 year ago | |

I've used this trick to build docker images from a collection of layers on the fly without de/recompression

kajaktum 1 year ago | |

I think ZSTD is also like this.

userbinator 1 year ago |

Besides a persistent off-by-one error, and the use of actual trees instead of table lookup for canonical Huffman, this is a pretty good summary of the LZ+Huffman process in general; and in the 80s through the mid 90s, this combination was widely used for data compression, before the much more resource-intensive adaptive arithmetic schemes started becoming popular. It's worth noting that the specifics of DEFLATE were designed by Phil Katz, of PKZIP fame. Meanwhile, a competing format, the Japanese LZH, was chosen by several large BIOS companies for compressing logos and runtime code.

Note that real-world GZIP decoders (such as the GNU GZIP program) skip this step and opt to create a much more efficient lookup table structure. However, representing the Huffman tree literally as shown in listing 10 makes the subsequent decoding code much easier to understand.

Is it? I found the classic tree-based approach to become much clearer and simpler when expressed as a table lookup --- along with the realisation that the canonical Huffman codes are nothing more than binary numbers.

082349872349872 1 year ago | |

I'd agree with TFA that canonical Huffman, although interesting, would be yet another thing to explain, and better left out of scope, but it does raise a question:

In what other areas (there must be many) do we use trees in principle but sequences in practice?

(eg code: we think of it as a tree, yet we store source as a string and run executables which —at least when statically linked— are also stored as strings)

mxmlnkn 1 year ago | | |

> In what other areas (there must be many) do we use trees in principle but sequences in practice?

Heapsort comes to mind first.

commandlinefan 1 year ago | |

Author here! I actually did a follow-up where I looked at the table-based decoding: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

yyyk 1 year ago | |

>before the much more resource-intensive adaptive arithmetic schemes started becoming popular

The biggest problem was software-patent stuff nobody wanted to risk before they expired.

lifthrasiir 1 year ago | |

The lookup table here would be indexed by a fixed number of lookahead bits, so it for example would duplicate shorter codes and put longer codes into side tables. So a tree structure, either represented as an array or a pointer-chasing structure would be much simpler.

jmillikin 1 year ago |

(2011)

Formatted version: https://infinitepartitions.com/cgi-bin/showarticle.cgi?artic...

Laiho 1 year ago |

If you prefer reading Python, I implemented the decompressor not too long ago: https://github.com/LaihoE/tiralabra

kuharich 1 year ago |

Past comments: https://news.ycombinator.com/item?id=6920822

solannou 1 year ago | |

Thanks a lot, how do you find this previous article so fast?

Lammy 1 year ago | | |

Click the domain name in parentheses to the right of the submission title.

hk1337 1 year ago |

Why do people use gzip more often than bzip? There must be some benefit but I don’t really see it, you can split and join two bzipped files (presumably CSV so you can see the extra rows). Bzip seems to compress better than gzip too.

082349872349872 1 year ago |

Has anyone taken the coding as compression (when you create repeated behaviour, stuff it in the dictionary via creating a function; switching frameworks is changing initial dicts; etc.) metaphor seriously?

eapriv 1 year ago | |

Yes. https://caseymuratori.com/blog_0015

commandlinefan 1 year ago | |

Sounds like LZW compression to me - is what you're thinking of different than that?

082349872349872 1 year ago | | |

Different in terms of applying the same principle to produce "semantically compressed" code — instead of having each new dictionary entry be a reference to an older one along with the new symbol, each new function will refer to older ones along with some new literal data.

(If that still doesn't make sense, see the sibling comment to yours.)