How gzip uses Huffman coding(jvns.ca) |
How gzip uses Huffman coding(jvns.ca) |
Half the challenge was just understanding what the RFCs were talking about, teaching ourselves huffman coding, and then figuring out how to build this low level stuff in Java in a reasonable way. This was definitely one of the hardest projects I had up until that point.
We had a class discussion afterwards about what was so difficult about implementing GZIP and a few of us mentioned the RFCs were tough to figure out. Professor laughed and said something along the lines of "Besides just learning computer science, we're hoping to teach you the value of good documentation!"
On the other hand, it is probably the simplest compression format in widespread use that uses variable-length encoding, and there's a few implementation around for you to look at. Things like bzip2 and LZMA2 really don't have independent specifications and multiple implementations, and are more complicated to boot.
Byte-oriented Lempel-Ziv formats like LZ4, LZO and LZJB would be a good way to dip your toes into a production compression format. The LZW encoding used by GIF files is also "simpler" in principle but I personally find it harder to wrap my head around.
If you want to experiment with binary formats, it might also be interesting to write a decoder for a simple image format, like PCX files, or even for old game file formats like Doom's [http://doom.wikia.com/wiki/WAD].