Xz format inadequate for long-term archiving (2016)

The recovery instructions for gzip is to simply do "zcat corrupt_file.gz > corrupt_file". While for bzip2 is to use the bzip2recover command which just dumps the blocks out individually (corrupt ones and all).

Uncompressing the corrupt gzip jpeg file via zcat at all times resulted in an image file the same size as the original and could be opened with any image viewer although the colors were clearly off.

I never could recover the image compressed with bzip2. Trying to extract all the recovered blocks made by bzip2recover via bzcat would just choke on the single corrupted block. And the smallest you can make a block is 100K (vs 32K for gzip?). Obviously pulling 100K out of a jpeg will not work.

Though I'm still confused as to how the corrupted gzip file extracted to a file of the same size as the original. I guess gzip writes out the corrupted data as well instead of choking on it? I guess gzip is the winner here. Having a file with a corrupted byte is much better than having a file with 100K of data missing...

gmueckl 7 years ago | |

Your method is clearly flawed. Altering a single byte once is insufficient as a test unless you analyzed the structure of the compressed file first to see where the really important information is stored. It may well be that you just modified a verbatim string from the source data in the gzip case, but corrupted a bit of metadata about how the compressed data is structured in the bzip2 case. If you tried a different random bytes, the results might be reversed.

The proper test would be to iterate over every bit in the compressed file, flip it and try to recover. Then compute number of successful recoveries against the number of bits tested. Compression algorithms that perform similarly should gmhave similar likelyhoods that a single bit flip corrupts the entirety of the data.

esaym 7 years ago | | |

I thought about that as well. I tried it three different times all with the same results.

wereHamster 7 years ago | |

Whether recovery leads to (almost) useable data depends on what byte you modify. It's entirely possible that a single corrupt byte in the compressed data leads to a single corrupt byte when uncompressed. When you are dealing with images you may not even notice that a single pixel is wrong. But it's also possible that you completely destroy the data such that the decompression algorithm can't even deal with it and has to give up.

chasil 7 years ago | | |

A decade and a half ago, I wrote an Oracle archived log that I had compressed with bzip2 to a DLT40 tape.

I recovered and uncompressed (without error) the log, then tried to apply it to a database recovery which rejected it as corrupt.

After several attempts to read the tape (amounting to dozens of hours), I finally put it in the original drive that wrote it and pulled the file to the remote recovery system - this worked.

I immediately began including PAR2 files on the tapes, so the restored contents could be verified and corrected.

I have my doubts that bzip2 is as sensitive to corruption as the author of asserts, but perhaps there have been improvements to the code since my misfortune.

xoa 7 years ago |

Not that many of the complaints aren't reasonable, but I thought that in general compression/format was orthogonal to parity, which is what I assume is actually wanted for long-term archiving? I always figured that the goal should normally to be able to get back out a bit-perfect copy of whatever went in, using something like Parchive at the file level or ZFS for online storage at the fs level. I guess on the principle of layers and graceful failure modes it's better if even sub-archives can handle some level of corruption without total failure, and from a long term perspective of implementation independence simpler/better specified is preferable, but that still doesn't seem to substitute for just having enough parity built in to both notice corruption and fully recover from it to fairly extreme levels.

Adamantcheese 7 years ago |

How about something like ZPAQ instead for archiving? Especially if you're doing backups and not a lot of the information is changing.

ltbarcly3 7 years ago |

No file format is perfect, I've been using xz for years and I can't think of a single issue I have had. The compression rate is dramatically better than gzip or bzip2 for many types of archives (especially when there is a large redundancy, for example when compressing spidered web pages from the same site you can get well over 99% size reduction compared to 70% reduction for gzip, which means using less than one 30th of the disk space).

Lately I have been using zstd for some things since it gives good compression and is much faster than xz.

This criticism of xz just seems nit picky and impractical, especially if you are compressing tar archives and/or storing the archives on some kind of raid which can correct some read errors (such as raid5).

asveikau 7 years ago |

I remember seeing this article before. This time the reaction that surges for me is: if you want long-term archiving but don't assume redundant storage, it's not going to go well. Put your long-term archives on ZFS.

StavrosK 7 years ago | |

Why are you assuming they aren't assuming redundant storage? Redundant storage isn't a cure-all, there's still a chance two blocks on two disks will fail in the exact same spot.

asveikau 7 years ago | | |

Seems easier to increase the amount of disks and address it at a low layer than to re-engineer all layers, all file formats, for corruption.

kipari 7 years ago | | |

I reckon that the chance of the same two blocks on two different disks failing between ZFS scrubs would be incredibly small.

mkj 7 years ago |

A bit of speculation here, but perhaps xz won over lzip because it has a real manpage?

lzip has the usual infuriating short summary of options with a "run info lzip for the complete manual". Also the source code repository doesn't even seem linked directly from the lzip homepage - technical considerations aren't the only thing that determines if software is "better", it also has to be well presented.

shmerl 7 years ago |

xz-utils should implement parallel decompression already. pixz is doing it, but stock xz is not. Most end users benefit from faster decompression.

SEJeff 7 years ago |

This should have (2016) in the title.

Thoreandan 7 years ago | |

and "from the author of lzip, a competing lzma library that never went viral".

Welcome to the Better Technology that Shoulda Made It bench. Your seat's over there next to OS/2, BeOS, and OpenGenera.

nkoren 7 years ago | | |

Amiga forever!!!!!!

LinuxBender 7 years ago |

If you first use tar to preserve xattrs/etc.. then you can use anything to compress. xz, bz2, 7z, even arj if you are feeling nostalgic.

    tar cvfJ ./files.tar.xz /some/dir

zamalek 7 years ago | |

You've missed the point of the article entirely. A single bit-flip (which is almost guaranteed over long-term) can easily render the entire xz file corrupt.

This has nothing to do with xattrs/etc.

LinuxBender 7 years ago | | |

Yes, I am totally on auto-pilot today. I'm used to a different article that gets re-posted often about xz and my browser blocks non-https sites so I assumed it was that other article.

That said, I use xz in automation that compresses files on one end and decompresses on the other. I've not had any file corruption thus far. checksums always match. Hopefully the author has submitted bug reports and ways to reproduce.

imiric 7 years ago | | |

I've been using tar+pixz+par2 for backups for a while now, but this article still worries me.

What can I replace pixz with that compresses as well and keeps the indexing functionality? I'd like to avoid zstd because Facebook.

microcolonel 7 years ago |

> "3 Then, why some free software projects use xz?"

Because the files are usually smaller than gzip, with faster decompression than bzip2, and the library is available on most systems.

h1d 7 years ago | |

Archiving for distribution and backups are very different things. You don't care if some app distribution compressed file gets corrupted, you just compress again but your compressed backup files usually don't have much source of reference.

I wouldn't use any unreliable format for backups. I picked bzip2 for stability and compression rate.

microcolonel 7 years ago | | |

In my opinion, the compressor is not the right place to add data integrity mechanisms, especially since data integrity mechanisms only really apply to particular media. Data on hard drives don't get corrupted in the same way as data on TLC SSDs, and generally on the latter you're better off with redundancy and diversification, than with inline error correcting codes.

Honestly, I don't see why xz should have any of its own data integrity mechanisms whatsoever, except maybe a whole-archive CRC32 or similar.