Bzip3 – A better and stronger spiritual successor to bzip2

Bzip3 – A better and stronger spiritual successor to bzip2(github.com)

158 points by palaiologos 4 years ago | 104 comments

hannob 4 years ago |

It seems somewhat suspicious that the benchmarks don't compare to zstd.

It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.

Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.

nousermane 4 years ago | |

> benchmarks don't compare to zstd.

  wget http://corpus.canterbury.ac.nz/resources/calgary.tar.gz
  zcat calgary.tar.gz|time zstd -19|wc -c

  902963 (=902.9KB, vs. 807.9KB for bzip3)

> "Better than bzip2" isn't exactly a convincing sales pitch

Sure, but nobody is pitching that. TFA does comapre to lzma (~xz), and claims bzip3 outperforms it quite handsomely in speed, while being competitive in compression ratio.

grumpyprole 4 years ago | |

There's more to a compression standard than benchmarks and ratios. Xz does not seem to score well in other, perhaps more important areas: https://www.nongnu.org/lzip/xz_inadequate.html

sounds 4 years ago | |

Can you test it out and post results back here?

linsomniac 4 years ago | | |

For linux-5.17.6.tar:

Original file: 129MB xz, 1.2G uncompressed.

"zstd -T0": 1.34 seconds, 189M

"xz -T0": 63 seconds, 131M

"xz -T0 -9": 183 seconds, 125M

"bzip3 -e -j 6": 21 seconds, 129M (edited, was SIGSEGV)

"bzip3 -e": 84 seconds, 129M

I used linux source because the source website uses linux and recommends bzip3 for compressing source and text. Results were on Ubuntu 22.04, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz

Matumio 4 years ago | | |

No, but it would be nice to see visually where it is on the size vs (de-)compression speed pareto-front. Like this graphic (from the zstd homepage): https://raw.githubusercontent.com/facebook/zstd/master/doc/i...

klauspost 4 years ago |

Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.

Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.

I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

bayindirh 4 years ago | |

> I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

I think it boils down to the feelings of the author (of the previous format).

I don't think PKWARE feels bad because ZSTD is a homage to ZIP. Similarly if someone created a follow-up file format to something I've designed, I'd just want credit or a link to my version as a homage and pointer for history continuity, nothing else.

Open source software is designed to be mangled, modified, shared and leapfrogged. If a completely different implementation advertises itself as a newer iteration of a format because it's built on the same theory, I think it's ethical if the developer is not intending to capitalize name. Either case, if the original developer returns to the game, it can create a BZIP4 and points to the diversion as, "hey, somebody liked BZIP2 too much and created this, give him/her a kudos", and continue.

altairprime 4 years ago | | |

Pkware might not have been so forgiving if someone had released ZIP2. Incrementing the version number like that is only an acceptable thing in relatively unusual circumstances, but does happen sometimes; and still, I would really hesitate to say that it’s a good idea for a third party to call itself bzip3.

The original author replaced their own bzip release with bzip2 to avoid a patent issue with arithmetic coding, so this is the first time a third party has done so: https://web.archive.org/web/19980704181204/http://www.muraro...

So if the release of bzip3 is approved by the current maintainer, then I guess it’s fine, but otherwise it makes me uncomfortable to consider using under this name.

eru 4 years ago | | |

> Open source software is designed to be mangled, modified, shared and leapfrogged.

I agree in spirit, but I can also see why someone might want their source to be free and mangleable, but still care about trademarks.

(Just imagine Linus Torvalds getting lots of emails with support requests for a hypothetical Linux2 operating system that I wrote, and that he has no relation with. That could become pretty annoying; even if he doesn't mind me taking his source code.)

adrianmonk 4 years ago | | |

> I don't think PKWARE feels bad because ZSTD is a homage to ZIP.

Is zstd actually an homage to zip?

I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.

The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format. As do two very old, obsolete Unix compression programs: "pack"[3], which is uses a ".z" suffix for compressed files, and "compress"[4], which uses a ".Z" suffix for compressed files.

In those algorithm names, the L and Z stand for Lempel and Ziv, respectively. But interestingly, the Unix "pack" program uses a ".z" suffix even though its algorithm is just Huffman (not one of the Lempel-Ziv family of algorithms), so the letter Z somehow came to signify data compression more generally.

Rough timeline of letter Z in data compression:

1977: LZ77

1978: LZ78

1982 or earlier: Unix "pack" (.z)

1984: LZW

1985: Unix "compress" (.Z)

1989: PKZIP

---

[1] https://en.wikipedia.org/wiki/LZ77_and_LZ78

[2] https://en.wikipedia.org/wiki/Lempel-Ziv-Welch

[3] https://en.wikipedia.org/wiki/Pack_(compression)

[4] https://en.wikipedia.org/wiki/Compress

perihelions 4 years ago | |

That's funny, 23 MiB/s is exactly what I get for reading systemd logs (on an NVME SSD). Is it supposed to be otherwise?

    $ sudo journalctl -r | pv -a > /dev/null
    [22.8MiB/s]

yakubin 4 years ago | | |

That appears to be systemd being slow.

  $ dd if=/dev/urandom of=test bs=1G count=1 iflag=fullblock
  $ gzip -k test
  $ zcat test.gz | pv -a >/dev/null
  [ 228MiB/s]

  $ sudo journalctl -r | pv -a >/dev/null
  [13.1MiB/s]

UPDATE: Gzip with more real-world data[1]:

  $ gzip -k adventures-of-huckleberry-finn.txt
  $ zcat adventures-of-huckleberry-finn.txt.gz | pv -a >/dev/null
  [ 151MiB/s]

[1]: <https://gutenberg.org/files/76/76-0.txt>

erk__ 4 years ago | | |

I think the lack of speed here is more that it has to serialize the data from disk into a readable format. I assume using the `--grep=` option is faster than piping it through grep because of this

vintermann 4 years ago | |

With the other parts of the codec, I doubt it's possible for files compressed with this, but one of the strengths of BWT-based compression is that there has been a lot of research on search operations directly on compressed data.

usefulcat 4 years ago | |

If you’re using xz, pixz can do multithreaded decompression. It’s still xz/lzma, so still expensive to decompress, but at least that allows you to throw as many cores as you want at it.

lynguist 4 years ago |

If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.

[1] https://github.com/lz4/lz4

klodolph 4 years ago | |

Zstandard achieves similar speeds at higher ratios. LZ4 only comes out ahead if you use LZ4 at, like, level 1.

klauspost 4 years ago | | |

Yes, zstd has forced bitwise match coding, whereas lz4 is byte-aligned and with inline literals.

So lz4 has some base advantages in terms of speed, which zstd is unlikely to match. But as you point out it is only relevant for very high speed operations.

Beltalowda 4 years ago | | |

It's still quite a bit faster than zstd -1, at least according to their GitHub page. The trade-off is it's a worse compression ratio (2.8 vs. 2.1), but in some cases that's a good trade-off.

adgjlsfhk1 4 years ago | | |

zstd has similar compression speeds, but lz4 crushed everything else for decompression.

loeg 4 years ago | | |

My impression was that lz4 ratios were still marginally better than zstd for the same compression speed, and decompression is much, much faster.

pcwalton 4 years ago |

The Burrows-Wheeler transform, which was the main innovation of bzip2 over gzip, and which this bzip3 retains, is one of the most fascinating algorithms to study: https://en.wikipedia.org/wiki/Burrows-Wheeler_transform

It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.

Klasiaster 4 years ago |

Here some other BWT compressors in the large text compression benchmark (look for "BWT" in "Alg" column): http://mattmahoney.net/dc/text.html

And here a BWT library with benchmarks: https://github.com/IlyaGrebnov/libsais#benchmarks

denzquix 4 years ago |

From their own benchmarks it seems more like bzip3 is geared towards a different compression/speed trade-off than bzip2, rather than an unambiguous all-around improvement. Am I misreading it?

once_inc 4 years ago | |

That's what I took out of it too. Sacrificed a bit of speed and a lot of memory for a smaller output size.

edit: ah, bzip3 is parallelizable, while bzip2 isn't. That alone is enough for me to be able to claim 'faster'.

thilog 4 years ago | | |

bzip2 can exploit concurrency through pbzip2, can't it?

joelthelion 4 years ago |

In the Era of zstandard, do we really need this?

wongarsu 4 years ago | |

I find it somewhat telling that they don't benchmark themselves against zstd.

Right now I'm almost exclusively using zstd (general stuff) or lzma2/xz (high compression where read speed doesn't matter). And of course gz and zip for data interchange where compatibility is key. From the information presented bzip3 won't replace any of those use cases for me, but that's fine. Maybe it fits somebody else's use case, or maybe it's the foundation for the next great algorithm that we all end up using.

palaiologos 4 years ago | | |

zstd -19 linux.tar 462.58s user 0.76s system 100% cpu 217M memory 7:42.56 total

% wc -c linux.tar.zst linux.bz3 134980904 linux.tar.zst 129255792 linux.bz3

cout 4 years ago | | |

Have you ever tried lzma/lzma2 with the hc3 (hash chain) match finder instead of the default (bt3 or bt4) match finder? I've found this to be a really good middle ground between gz/deflate and lzma2 with default settings.

proofrock 4 years ago | |

Yes, because someone said the same when zstandard came out. This may not have the same strong points, but maybe the next will… compression is not a completed task.

trasz 4 years ago | |

Not to mention the restrictive license which effectively prohibits its use in any Open Source project licensed under anything other than GPLv3.

palaiologos 4 years ago | | |

Frankly, same holds for gzip. I've been planning to relicense bzip3 with the more permissive LGPLv3.

baybal2 4 years ago | |

1. zStandard is not a standard

2. Bzip2 is somewhat is a standard

3. zStandard is not a substitute for Bzip2

Beltalowda 4 years ago | | |

In what way is bzip2 more of a "standard" than zstd? bzip2 doesn't even seem to have any official reference description of its file format; just an "unofficial" one[1], whereas zstd is RFC 8478[2].

When I evaluated various compression algorithms a few years ago zstd came ahead of bzip2 in every metric.

[1]: https://github.com/dsnet/compress/blob/master/doc/bzip2-form...

[2]: https://datatracker.ietf.org/doc/html/rfc8478

yakubin 4 years ago |

From the "disclaimers" section:

> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.

> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.

That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?

[1]: C being the compression function, D being the decompression function, and o being function composition.

forgotpwd16 4 years ago | |

Cannot answer your question but, since you mentioned it, there's a mathematical specification of deflate (see: https://arxiv.org/abs/1609.01220).

asicsp 4 years ago |

Good work!

I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: https://github.com/kspalaiologos/bzip3/issues/2

roelschroeven 4 years ago | |

That discussion doesn't really clear up my confusion though.

I don't understand how bzip3 gets to claim "A better, faster and stronger spiritual successor to BZip2." when even all its own benchmarks show it's slower than bzip2?

palaiologos 4 years ago | | |

bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size.

what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.

it should be noted that while using 16 times larger block sizes than bzip2 while providing compression ratios up to 10%-50% better at a cost of, as empirically shown, 17 seconds per 1.3GB of data, is a pretty good trade-off and if bzip2 wanted to get anywhere close to that (e.g. using the C API to tweak the block size), it'd have to sacrifice a lot of its performance.

easytiger 4 years ago | | |

And what does stronger mean? It's not cryptography.

Unless it is

williamkuszmaul 4 years ago |

One of the things that's cool about Bzip is that it makes use algorithmic techniques developed by theoretical computer scientists in order to perform the Burrows Wheeler Transform efficiently. It's a great example of theory and practice working symbiotically.

forgotpwd16 4 years ago |

>better, faster

If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.

>spiritual successor to BZip2

What does that mean? If it isn't related to bzip2, why choose this name?

alerque 4 years ago | |

It is related to bzip2 in the sense of using the Burrows-Wheeler algorithm.

fefe23 4 years ago |

Hmm, I see LZ77, PPM and entropy coding in the description, and obviously Burrows-Wheeler.

Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?

Does the idea even make sense? (I'm a layman)

iruoy 4 years ago |

So bzip2 and bzip3 focus on compressed size, lz4 on compression speed and zstd on decompression speed?

jeffbee 4 years ago | |

I don't know if that's really accurate. LZ4 is often faster on both sides, while usually having a larger compressed size. On most of the inputs listed at this benchmark, LZ4 is twice as fast for compression, 50% faster for decompression, while having a compressed size about 125% as large as Zstd. My rule of thumb is that zstd is good if you're going to store or transmit the result, while lz4 is the better choice if you're planning to compress and decompress exactly once without storing (i.e. as a transfer encoding between two network peers).

jkbonfield 4 years ago |

It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).

On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.

Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.

palaiologos 4 years ago | |

You've literally tested it on a single file, enwik8. That's not enough to extrapolate valuable results. One of the benchmarks:

  time ./bsc e ../linux.tar linux.bsc -e2 -b16 -T
  68.69s user 1.14s system 99% cpu 117M memory 1:09.84 total

While bzip3 uses 98M, takes 1min 17s to produce a 129023171 byte file, compared to 127747834B from BSC. They're very similar except bzip3 tends to use less memory and decompresses a little slower. BSC is much more mature than bzip3 though, and the benchmarks might be a subject to change some time in the future. Surprisingly, BSC code isn't really that robust (I reported a UB bug to libsais and had to pretty much rework the LZP code because it couldn't stand fuzzing).

jkbonfield 4 years ago | | |

Well yes it was one file, but it was stated as being good on text and enwik8 is a pretty standard test corpus for text compressors.

I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.

Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.

Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.

kstenerud 4 years ago |

There comes a point where the complexity itself becomes too much of a liability. It's important to be able to trust these algorithms as well as all popular implementations with your data.

mschuster91 4 years ago | |

One should verify the integrity of stuff like backups or archives anyway, by supplying the end user with a sha1 or better hash of both the compressed/encrypted archive as well as all of the files it contains, and by regularly verifying if both still match.

bell-cot 4 years ago | | |

Yes...though I'd say to rule out sha1, or any other "no longer considered secure" hashes. The space & time savings (vs, say, sha-512) are really not worth baking into your backup format & procedures. Keep in mind that you might need to really verify the integrity of your backup during a ransomware incident, or as part of a high-stakes legal situation, or ...

cestith 4 years ago | | |

Ideally, yes. Yet logrotate uses zlib to compress log files and deletes the originals. That's how trusted it is.

AceJohnny2 4 years ago |

Will bzip3 be added to the Squash benchmarks?

https://quixdb.github.io/squash-benchmark/

I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:

https://en.wikipedia.org/wiki/Calgary_corpus

the-alchemist 4 years ago |

I'm really interested in GPU-based compression / decompression.

Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?

Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.

alerque 4 years ago | |

GPU's are good at massively parallel tasks. Compression is, almost by definition, not a parallel problem. If you want speed you can break it up into chunks and if you are optimizing from throughput there are gains to be made there. But if you are optimizing for compression, the more chunks you break the task up into the less opportunity you have to find ways to compress it. For example a fast compression tool creating an archive of files might split up each file into a different thread which gets the job done fast, but it will loose out on huge gains in compression if there are common parts to files that could have been compressed if there were processed as a single blob. GPUs are designed to do lots of small chunks of work in parallel, CPUs are better at doing bigger jobs faster.

oefrha 4 years ago |

Interesting, this seems to be a good replacement for xz if the benchmarks are representative.

joppy 4 years ago |

Why is there such a big disclaimer/warning on the front? Shouldn’t the program just check that decompress(compress(x)) = x as it goes, and then it can be sure that compress(x) has not lost any data?

palaiologos 4 years ago | |

no compressor tests the output while compressing as it hurts the performance. you can do it after compressing, though, using `bzip3 -t`.

joppy 4 years ago | | |

Right, but I would probably test the output while compressing instead of putting a big USE AT YOUR OWN PERIL sign across the front…

72deluxe 4 years ago |

I use pbzip2 with gusto because the original bzip2 is single-threaded. I heartily recommend it to all I meet, even those in the street!

rurban 4 years ago |

Can be easily improved by using the HW crc32, it's just SW crc32.

themusicgod1 4 years ago |

> Github link

...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.