Xz format inadequate for long-term archiving (2016)(lzip.nongnu.org) |
Xz format inadequate for long-term archiving (2016)(lzip.nongnu.org) |
https://gcc.gnu.org/ml/gcc/2017-06/msg00044.html
https://lists.debian.org/debian-devel/2017/06/msg00433.html
It was a bit bizarre when he hit the Octave mailing list.
Eventually, people just wanted xz back:
http://octave.1599824.n4.nabble.com/opinion-bring-back-Octav...
I took a copy of a jpeg image, compressed it different times with either gzip or bzip2, then with a hexeditor modified one byte.
The recovery instructions for gzip is to simply do "zcat corrupt_file.gz > corrupt_file". While for bzip2 is to use the bzip2recover command which just dumps the blocks out individually (corrupt ones and all).
Uncompressing the corrupt gzip jpeg file via zcat at all times resulted in an image file the same size as the original and could be opened with any image viewer although the colors were clearly off.
I never could recover the image compressed with bzip2. Trying to extract all the recovered blocks made by bzip2recover via bzcat would just choke on the single corrupted block. And the smallest you can make a block is 100K (vs 32K for gzip?). Obviously pulling 100K out of a jpeg will not work.
Though I'm still confused as to how the corrupted gzip file extracted to a file of the same size as the original. I guess gzip writes out the corrupted data as well instead of choking on it? I guess gzip is the winner here. Having a file with a corrupted byte is much better than having a file with 100K of data missing...
The proper test would be to iterate over every bit in the compressed file, flip it and try to recover. Then compute number of successful recoveries against the number of bits tested. Compression algorithms that perform similarly should gmhave similar likelyhoods that a single bit flip corrupts the entirety of the data.
I recovered and uncompressed (without error) the log, then tried to apply it to a database recovery which rejected it as corrupt.
After several attempts to read the tape (amounting to dozens of hours), I finally put it in the original drive that wrote it and pulled the file to the remote recovery system - this worked.
I immediately began including PAR2 files on the tapes, so the restored contents could be verified and corrected.
I have my doubts that bzip2 is as sensitive to corruption as the author of asserts, but perhaps there have been improvements to the code since my misfortune.
Lately I have been using zstd for some things since it gives good compression and is much faster than xz.
This criticism of xz just seems nit picky and impractical, especially if you are compressing tar archives and/or storing the archives on some kind of raid which can correct some read errors (such as raid5).
lzip has the usual infuriating short summary of options with a "run info lzip for the complete manual". Also the source code repository doesn't even seem linked directly from the lzip homepage - technical considerations aren't the only thing that determines if software is "better", it also has to be well presented.
Welcome to the Better Technology that Shoulda Made It bench. Your seat's over there next to OS/2, BeOS, and OpenGenera.
tar cvfJ ./files.tar.xz /some/dirThis has nothing to do with xattrs/etc.
That said, I use xz in automation that compresses files on one end and decompresses on the other. I've not had any file corruption thus far. checksums always match. Hopefully the author has submitted bug reports and ways to reproduce.
What can I replace pixz with that compresses as well and keeps the indexing functionality? I'd like to avoid zstd because Facebook.
Because the files are usually smaller than gzip, with faster decompression than bzip2, and the library is available on most systems.
I wouldn't use any unreliable format for backups. I picked bzip2 for stability and compression rate.
Honestly, I don't see why xz should have any of its own data integrity mechanisms whatsoever, except maybe a whole-archive CRC32 or similar.
The src/ tree of xz is 335k (compressed with gzip). If you are worried future digital historians won't be able to figure out the xz format, throw a copy of the gzip'd source onto every drive you store archives on, it would basically be free and would almost guarantee they would have a complete copy of exactly what they would need to decompress the files.
Again this is shooting the breeze a bit, article is discussing a case where there should be the freedom to choose better formats. But for a lot of important archive material, including software itself, are we getting to the point where many long term archives should simply including everything necessary to deal with them in the present day as a container or VM image, which is then stored with a solid amount of parity and replication?
Unfortunately, any such image would presume you have access to the hardware, or it has low-level instruction sets/processor design baked in. Think how many PDP-11's are around today. And in terms of an archive it's only been 50 years since the PDP-11 was invented. That's a blink of an eye in terms of archival standards.
Why does it matter how many physical machines are alive? There are tons of emulators around. There is even one in Javascript, with an ability to load disk images as well.
The hard part is hardware - the drives go bad, the computers fail. But disks grow, and it is getting simpler and cheaper to store lots of data. As long as you keep copying the files to modern media every 10 years or so, you should no longer have anybdata loss.
(The only exception is proprietary data formats which cannot be opened except by original program which cannot be run in VM easily. Those should be avoided at all costs)
There are a number of cases where failures might not be independent, though.
What if, say, you're using multiple drives of the same model, which have a firmware bug causing them to sometimes mangle data on the Nth sector?
What if you're using multiple drives from the same manufacturing batch which have a flaw leading to certain regions being more likely to fail than others?
What if you're using some battery-backed write cache under ZFS (from a HW RAID card or something more exotic), and it helpfully writes out garbage to the same sector on two disks?
What if you have a certain manufacturer's hard drives that lie about flushing their write cache successfully to disk if you issue a SMART request to them between when they put data in cache and when it actually gets to disk, so polling those two disks when they both just got a write results in data loss?
(The last of these is a real firmware bug I ran into - I was running a testbed of a bunch of raidz3 vdevs, and spent some time isolating when zpool scrub kept making the error counters increase even though it had corrected them all...thanks, Samsung HD204UI drives.)
This comes up on the linux raid list with some frequency whenever there are drive failures with raid56, and the subsequently the raid trips over a single bad sector.
But it's true that lack of scrubbing contributes to this scenario, as well as the terrible combination of consumer drives with very high bad sector recovery times and the Linux SCSI command timer default of 30 seconds. That combination ends up causing a masking of bad sectors that end up not getting repaired, and as a user you may not realize that the link resets are not normal and suggest a bad sector as the cause.
Which raid s/w does this ?
In the case where the drive error timeout is longer than the SCSI block layer, it just results in a link reset. The actual problem with the drive is obscured by the reset, including the bad sector, so it never gets repaired.
Btrfs, mdadm, lvm are affected and I'm pretty sure ZFS on Linux as well assuming they haven't totally reimplemented their own block layer outside of the SCSI subsystem.
It's a super irritating problem, the kernel developers know all about it, but thus far it's considered something distributions should change for the use cases that need it. And what that means so far is distros don't change it and users using consumer drives with high error recovery times, get bitten.
Although honestly in a thousand year timeframe I very much doubt humanity will preserve ZFS, gzip, tar, jpeg, PNG, ASCII, today's spoken and written languages in current form, etc. Just as written material from 1000 years ago is not very accessible to most people; with the original material you need intense study before you even know what you're looking at.
Point being that computer languages come and go.
[1] https://en.wikipedia.org/wiki/IBM_RPG
https://en.wikipedia.org/wiki/Egyptian_hieroglyphs
https://en.wikipedia.org/wiki/Judaeo-Aragonese
https://en.wikipedia.org/wiki/Latin
Any argument you can make about historians being able to recover dead languages you can make the exact same argument for their ability to recover dead computer languages, and there is no better or more accurate specification than the actual code.
So let me add to my recommendation, in addition to a copy of the xz source code, include a plain text copy of any 'how to program in C' book, or just the wikipedia page for the C language. That is more than enough for them to construct a program that can decompress xz files, once they relearn how to read whatever long dead language the book is written in (Ancient Pre-Cataclysm Earth English for example).
Sure but are they going to remember something like, weird precedence rules (See: &), undefined behaviour, etc. Just because they want to reimplement a specific, small, program does not mean they want to relearn several languages. What you're saying could easily blow up from 'how to code C' to 'reading the GCC / Clang compiler source code to figure out how a specific UB was implemented, which the program in this specific case falls into', which I'm sure nobody wants to spend their weekend doing, implementing something like `xz` could simply be a midpoint in their destination, they don't want to spend weeks digging up COBOL. Have at least some consideration for the human element, jeez.
Documentation, specifically _mathematical_ documentation, is more fault tolerant than either psuedocode or actual code.
At any other time, I would agree with you, but where archivism is concerned, I do not.
RPG is still around, and IBM still sells it on their cloud. But the language is highly proprietary, so don't expect a cheap access to it.
ALGOL-58 is one of the languages which died; but ALGOL-68 is in the current debian repos, and would take under 30 seconds to install.
FLOW-MATIC has died, but COBOL is around and again, easily installable.
I think you are underestimating how much legacy software there is. For example, Fortran 77 is still actively used, and there are programs written in it every day. There is immense amount of programs written in C89. The support for those languages is likely to stay forever.
In general, I think this topic is very interesting. Imagine 1000 years have passed, and all the computers are running YEAR3000 architecture which is incompatible with all the software we have today. Archeologists discover a treasure trove of texts and binary files from 21th century internet. They know ASCII and English, but nothing else. What can they do?
The answer is surprisingly simple:
(1) Write an emulator for an simple CPU, like an ARMv5. Here is a good one: https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bi...
You'd need to manually port this code to whatever language you are using now. But this should be doable -- the software has 6000 lines of very straightforward C89 code. It does not use any OS services, nor does it rely on UB or complex language features.
(2) Use it to boot Linux (the image is included in that webpage). This allows you to run Ubuntu from 2009 on your YEAR3000 architecture.
(3) If your archive contains repository snapshot from 2009 to your machine. You can now install and run all the 20th century software on your YEAR3000 computers. Congrats!
(4) The only thing missing is graphics support. Just run x11vnc (included in the Jaunty repo) over serial port (included in dmitry.gr's emulator). VNC protocol is simple and well specified.
... and that's how I'd bootstrap 20th century computing on 30th century infrastructure. Sure, it will take some effort, -- but this only needs to be done once, and running programs will be easy from there on.
The post I was responding to implied a raid array could be degraded and you wouldn’t know till it completely failed
Interesting nevertheless
Just like in the end-to-end principle when applied to networking: you have a single strong integrity check at the very furthest endpoint possible, and then you don't build in integrity & ECC at every level of the stack, you devote those resources to higher performance, and just do retransmission from the other endpoint when a file occasionally gets corrupted and the integrity check catches it.
There will be many, many people that will gladly dig into the minutia and technical details of arcane hardware, especially when it means making progress towards filling in the historical record. This is already the case today, there is a working https://en.wikipedia.org/wiki/Colossus_computer reconstructed just because it was historically significant.
> which I'm sure nobody wants to spend their weekend doing [if their original goal was to simply reconstruct xz].