It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.
Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.
wget http://corpus.canterbury.ac.nz/resources/calgary.tar.gz
zcat calgary.tar.gz|time zstd -19|wc -c
902963 (=902.9KB, vs. 807.9KB for bzip3)
> "Better than bzip2" isn't exactly a convincing sales pitchSure, but nobody is pitching that. TFA does comapre to lzma (~xz), and claims bzip3 outperforms it quite handsomely in speed, while being competitive in compression ratio.
Original file: 129MB xz, 1.2G uncompressed.
"zstd -T0": 1.34 seconds, 189M
"xz -T0": 63 seconds, 131M
"xz -T0 -9": 183 seconds, 125M
"bzip3 -e -j 6": 21 seconds, 129M (edited, was SIGSEGV)
"bzip3 -e": 84 seconds, 129M
I used linux source because the source website uses linux and recommends bzip3 for compressing source and text. Results were on Ubuntu 22.04, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.
I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.
I think it boils down to the feelings of the author (of the previous format).
I don't think PKWARE feels bad because ZSTD is a homage to ZIP. Similarly if someone created a follow-up file format to something I've designed, I'd just want credit or a link to my version as a homage and pointer for history continuity, nothing else.
Open source software is designed to be mangled, modified, shared and leapfrogged. If a completely different implementation advertises itself as a newer iteration of a format because it's built on the same theory, I think it's ethical if the developer is not intending to capitalize name. Either case, if the original developer returns to the game, it can create a BZIP4 and points to the diversion as, "hey, somebody liked BZIP2 too much and created this, give him/her a kudos", and continue.
The original author replaced their own bzip release with bzip2 to avoid a patent issue with arithmetic coding, so this is the first time a third party has done so: https://web.archive.org/web/19980704181204/http://www.muraro...
So if the release of bzip3 is approved by the current maintainer, then I guess it’s fine, but otherwise it makes me uncomfortable to consider using under this name.
I agree in spirit, but I can also see why someone might want their source to be free and mangleable, but still care about trademarks.
(Just imagine Linus Torvalds getting lots of emails with support requests for a hypothetical Linux2 operating system that I wrote, and that he has no relation with. That could become pretty annoying; even if he doesn't mind me taking his source code.)
Is zstd actually an homage to zip?
I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.
The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format. As do two very old, obsolete Unix compression programs: "pack"[3], which is uses a ".z" suffix for compressed files, and "compress"[4], which uses a ".Z" suffix for compressed files.
In those algorithm names, the L and Z stand for Lempel and Ziv, respectively. But interestingly, the Unix "pack" program uses a ".z" suffix even though its algorithm is just Huffman (not one of the Lempel-Ziv family of algorithms), so the letter Z somehow came to signify data compression more generally.
Rough timeline of letter Z in data compression:
1977: LZ77
1978: LZ78
1982 or earlier: Unix "pack" (.z)
1984: LZW
1985: Unix "compress" (.Z)
1989: PKZIP
---
[1] https://en.wikipedia.org/wiki/LZ77_and_LZ78
[2] https://en.wikipedia.org/wiki/Lempel-Ziv-Welch
$ sudo journalctl -r | pv -a > /dev/null
[22.8MiB/s] $ dd if=/dev/urandom of=test bs=1G count=1 iflag=fullblock
$ gzip -k test
$ zcat test.gz | pv -a >/dev/null
[ 228MiB/s]
$ sudo journalctl -r | pv -a >/dev/null
[13.1MiB/s]
UPDATE: Gzip with more real-world data[1]: $ gzip -k adventures-of-huckleberry-finn.txt
$ zcat adventures-of-huckleberry-finn.txt.gz | pv -a >/dev/null
[ 151MiB/s]
[1]: <https://gutenberg.org/files/76/76-0.txt>So lz4 has some base advantages in terms of speed, which zstd is unlikely to match. But as you point out it is only relevant for very high speed operations.
It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.
And here a BWT library with benchmarks: https://github.com/IlyaGrebnov/libsais#benchmarks
edit: ah, bzip3 is parallelizable, while bzip2 isn't. That alone is enough for me to be able to claim 'faster'.
Right now I'm almost exclusively using zstd (general stuff) or lzma2/xz (high compression where read speed doesn't matter). And of course gz and zip for data interchange where compatibility is key. From the information presented bzip3 won't replace any of those use cases for me, but that's fine. Maybe it fits somebody else's use case, or maybe it's the foundation for the next great algorithm that we all end up using.
% wc -c linux.tar.zst linux.bz3 134980904 linux.tar.zst 129255792 linux.bz3
2. Bzip2 is somewhat is a standard
3. zStandard is not a substitute for Bzip2
When I evaluated various compression algorithms a few years ago zstd came ahead of bzip2 in every metric.
[1]: https://github.com/dsnet/compress/blob/master/doc/bzip2-form...
> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.
> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.
That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?
[1]: C being the compression function, D being the decompression function, and o being function composition.
I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: https://github.com/kspalaiologos/bzip3/issues/2
I don't understand how bzip3 gets to claim "A better, faster and stronger spiritual successor to BZip2." when even all its own benchmarks show it's slower than bzip2?
what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
it should be noted that while using 16 times larger block sizes than bzip2 while providing compression ratios up to 10%-50% better at a cost of, as empirically shown, 17 seconds per 1.3GB of data, is a pretty good trade-off and if bzip2 wanted to get anywhere close to that (e.g. using the C API to tweak the block size), it'd have to sacrifice a lot of its performance.
Unless it is
If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.
>spiritual successor to BZip2
What does that mean? If it isn't related to bzip2, why choose this name?
Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?
Does the idea even make sense? (I'm a layman)
On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.
Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.
time ./bsc e ../linux.tar linux.bsc -e2 -b16 -T
68.69s user 1.14s system 99% cpu 117M memory 1:09.84 total
While bzip3 uses 98M, takes 1min 17s to produce a 129023171 byte file, compared to 127747834B from BSC. They're very similar except bzip3 tends to use less memory and decompresses a little slower. BSC is much more mature than bzip3 though, and the benchmarks might be a subject to change some time in the future. Surprisingly, BSC code isn't really that robust (I reported a UB bug to libsais and had to pretty much rework the LZP code because it couldn't stand fuzzing).I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.
Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.
Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.
https://quixdb.github.io/squash-benchmark/
I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:
Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?
Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.
...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.
--grep=
That's exactly what I needed to know! I'm glad I asked the stupid question. Thank you!There is no shortage of compression tools that are better in this or that. At the very least this is a fun engineering exercise, but there is always a massive inertia propblem regarding install base with compression, esp. for data-at-rest. gzip is still used by default in so many contexts, and that's not such a bad thing IMHO.
xz -C sha256 will use sha256 on the uncompressed data.
Personally, I use mtree inside the backup file (from pkgsrc bootstrap on Linux), although it has trouble with a few unicode file names. That kind of tool is great and I'm not sure why there doesn't seem to be an equivalent in the Linux world.
I can see how bzip3 is better able to exploit the characteristics of modern hardware, but that's not enough to call it faster. The proof of the pudding is in the eating, and if the benchmarks show bzip3 is slower than bzip2 than bzip3 is clearly not faster (but perhaps "aiming to be faster", with some work to be done before it reaches that goal).
Better compression at a slightly slower pace can indeed be a good trade off. I'm not saying bzip3 doesn't work well. What I'm saying is that the "faster" in its description "A better, faster and stronger spiritual successor to BZip2" is not supported by the evidence.
When I quickly looked, bzip2 didn't have a trademark, and I assumed the developer don't care.
It's same for me. If I care, I'd trademark it, and prevent people from using it. If I'm giving the name away, I'd not trademark it. It's plain and simple.
I was perhaps a bit fuzzy: I used the word trademark, but I meant the moral equivalent that only confers moral rights and obligations. Ie don't be a jerk and name your stuff in a confusing way, even if there's no legal obligation.
Even this fuzzier 'moral' trademark only applies, if the original owner wants it to apply. I have no special insight into the bzip2 situation. So I have no clue whether the authors of bzip or bzip2 cared. I gladly defer to your research on this question.
My point was meant purely abstract about the possibility of wanting your source to be free and open and fondled by others, but still have a (morally) protected name.
In practice it works, but it isn't pretty ;)
(context: I have had a situation where files created by pbzip2 on linux were not able to be decompressed with some library on .NET, but using lbzip2, they were. I never looked into the details.)
With real data, deflate maxes out somewhere around there either way, but that is a bit coincidental.
With modern CPUs getting increasingly smaller IPC improvements this will likely be pretty much the max decompression speed we can expect from gzip going forward.
I was getting the same numbers with text data I have scattered on my disk, but those were small, so I decided to generate a bigger file. But, yes, I agree a more robust benchmark would use a Mark Twain novel e.g.
I think so. Or more generally, DEFLATE family of algorithms used by zlib and ZIP.
> I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.
> The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format.
I'm well aware of the Ziv, Lempel & Welch's work, since I've developed a compression algorithm [*] and did extensive research on the family before, however the usage of extension is a rather new information for me.
On the other side, official repository of zstd [0] lists zlib and other libraries which use DEFLATE in one way other, pointing back to Ziv's ZIP at the same time.
At the end of the day, it might be tipping its hat directly to "ZIP" per se, but to general direction of DEFLATE which is used by ZIP format too.
[*]: The algorithm I developed was working on syllables rather than bytes. It had a deterministic and fast hyphenation engine for the language, and used embedded dictionary to minimize bit-flip damage during transit. I have paper published from it, and still planning to re-implement and open it, since the PoC was beyond bad from a code quality perspective.
I could've been clearer, but when I said the only connection I saw between them was the Z, I meant the only connection between their names. In other words, while zstd surely has zip among its main influences, I'm not sure that the Z in "zstd" is there to convey "zip-like". It could be, or maybe it's just there because Z is generally associated with data compression.
The author of lzip has harsh criticism of xz, and admiration of bzip2 for error detection/correction and "rightsizing" the container format.
I use lzip in preference to xz unless I need portability.
https://indico.fnal.gov/event/16264/contributions/36466/atta...
You can see the classic Pareto frontier, with LZ4 filling the niche at the very bottom right edge of the graph.
P.7 of your slides doesn't seem to cover Zstd --fast or decompression. It would be interesting to see how Zstd performs in those modes.
(An interesting anecdote about the differing notions of compressibility - when I recently wrote something to do a clever dance to avoid burning a great deal of CPU on incompressible data with higher zstd levels, I ended up using LZ4 -> zstd-1 as a two-tier filter to catch incompressible data, because what they each thought was incompressible was different enough that only using LZ4 lost a significant amount of compression sometimes, but only using zstd-1 was comparatively expensive and also lost a significant fraction.)
# compression
bzip3 -j 4 -e linux-5.18-rc6.tar linux-5.18-rc6.tar.bz3
user: 345.48s system: 0.59s cpu: 373% total: 1:32.75
zstd -19 --long -T4 -f linux-5.18-rc6.tar
user: 1270.48s system: 0.89s cpu: 376% total: 5:37.9
> du -b linux-5.18-rc6.tar.* | sort -rn | reln
1.000000 130907738 linux-5.18-rc6.tar.zst
0.994715 130215881 linux-5.18-rc6.tar.bz3
With additional ‘--ultra -22’ tar.zst is smaller, but the compression time sky rockets. # decompression
bzip3 -j 4 -d linux-5.18-rc6.tar.bz3 linux-5.18-rc6.tar
user: 222.57s system: 0.92s cpu: 362% total: 1:01.69
bzip3 -d linux-5.18-rc6.tar.bz3 linux-5.18-rc6.tar
user: 141.29s system: 0.89s cpu: 99% total: 2:22.19
zstd -d -T4 -f linux-5.18-rc6.tar.zst
user: 2.26s system: 0.84s cpu: 99% total: 3.102
zstd doesn’t seem to support parallel decoding, but still 20x faster"bzip3 -e -j 6 -b 50": 25 seconds, 125MB
So nearly as good as the best of xz, but in a 20th the time.
However: Do note that any unexpected use is met with a SIGSEGV: using as a filter, using "-j6" instead of "-j 6", not specifying "-e"...
% bzip3 -e -j 6 -b 50 corpus/calgary.tar
% bzip3 -j 6 -b 50 corpus/calgary.tar
bzip3 - A better and stronger spiritual successor to bzip2.
Copyright (C) by Kamila Szewczyk, 2022. Licensed under the terms of GPLv3.
Usage: bzip3 [-e/-d/-t/-c] [-b block_size] input output
Operations:
-e: encode
-d: decode
-t: test
Extra flags:
-c: force reading/writing from standard streams
-b N: set block size in MiB
-j N: set the amount of parallel threads
you can use bzip3 as a filter: % cat corpus/calgary.tar | bzip3 -b 10 -e -c | wc -c
807959
and using "-j6" is simply being unable to read the help page. } else if (argv[i][1] == 'j') {
workers = atoi(argv[i + 1]);
i++;
If the last argument is "-j6" then this will read past the end of the allocated argv strings and try to do atoi(NULL): % ./bzip3 -j3 < README.md
Segmentation fault
"-j6" is standard getopt() behavior, and the default expected behavior from Unix/POSIX systems. sec KB
gzip 0.19 1070
zstd 0.02 1063
zstd -19 1.47 897
bzip2 0.35 891
brotli 8.12 862
xz 1.35 853
bzip3 0.52 808"xz -d": 7.92 seconds
"bzip3 -d" as filter: SEGSEGV
"bzip3 -d linux-5.17.6.tar.bz3": 81 seconds
"bzip -d -j 6": 21.35 seconds (edit)
Lest I be called a liar again:
$ ./bzip3-1.0.1/bzip3 -d <linux-5.17.6.tar.bz3 >/dev/null
fish: Job 1, 'time ./bzip3-1.0.1/bzip3 -d <li…' terminated by signal SIGSEGV (Address boundary error)