ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot(nctritech.wordpress.com) |
ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot(nctritech.wordpress.com) |
A few years ago I, when I was on a game console team, a hardware engineer came to my desk and said, "Can you find out what's wrong with this disk drive?" It had come from a customer whose complaint was that games sometimes failed to download and game saves became unreadable.
I spent a fun afternoon tracking down what turned out to be a stuck-at-zero bit on that drive's cache. Just above the drive's ECC-it-to-death block storage was this flaky bit of RAM that was going totally unchecked. The console had a Merkle-tree based file system and easily detected the failure, but without that addition checking the corruption would have been very subtle, most of the time.
Okay, so that's just one system out of millions, right? What are the chances? Well, at the scale of millions, pretty much any hole in data integrity is going to be found out and affect real, live customers at some not insignificant rate. You really shouldn't be amazed at the number of single-bit memory errors happening on consumer hardware (from consoles to PCs -- and I assume phones). You should expect these failures and determine in advance if they are important to you and your customers.
Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.
Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.
If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.
A bunch of this article reads as if this scenario, which I in fact hit, won't happen, drives do it better, etc. But it happens. It happened to me. The drive did not "magically fix itself", and instead got worse over time. With ZFS, if it happens again, I can be told where it happened, exactly what files are affected, etc., and that's already better than what I got with that other disk which didn't have ZFS.
Plus the ZFS tools like snapshotting, send/receive, scrub being able to check integrity while the system is running... Those are great features.
(On a side note, ZFS -- at least OpenZFS -- doesn't support any CRC algorithms for use as its checksum.)
ZFS doesn't have to guess which copy is wrong. It knows, and it will automatically replace it.
More, ZFS will even do this on a ZFS mirror when reading half the data blocks from one disk and half from the other, because it reads the cryptographically-strong checksums in with each data block and checks them before delivering the data to the application. If the checksum doesn't match, it rewrites that block from the redundant copy on the other disk(s).
RAID can't do that. If one of a mirror's data blocks is corrupted on disk but with a correct ECC, so that the two blocks don't match but both read cleanly, RAID can't tell which one is correct, so it'll typically just force the system administrator to choose one disk to overwrite the other with. That exchanges astronomical odds against incorrect data for coin flip odds against.
From the idea that SMART reliably detects hard drive failures.. to dismissing data protection for no reason other than it sounds unlikely to the author (which in several cases I know personally to be false... because I've experienced those failures).
ZFS is a very well designed filesystem. Things weren't added haphazardly or because they sounded cool. The author would do well to try to understand why those protections were added.
Sun/Oracle, and a lot of popular third party documentation, has said as such very openly, and commands like zfs send/recv exist to easily automate zfs cloning (to backup from one zfs fs to another, for example, if you choose to do it that way).
I suspect whoever wrote this missed the boat on why zfs works.
The same "I've never seen it so it's not real" fallacy appears again in the discussion of RAID 5. He says that losing a second drive during a rebuild is "statistically very unlikely" but that's not so. Not only have I seen it many times, but the simple math of disk capacities and interface speeds shows that it's not really all that unlikely. I've seen RAID 6 fail because of overlapping rebuild times, leading people to push for more powerful erasure-coding schemes. Over the lifetime of even a medium-sized system, concurrent failures on RAID 5 are likely enough to justify using something stronger.
I was one of the earliest and most outspoken critics of ZFS hype and FUD when it came out. It was and is no panacea, but that doesn't justify more FUD in the other direction to sell backup products or services.
ZFS certainly isn't a magic wand you should wave at anything and everything and it doesn't replace backups but it does make the chances of something going wrong undetected much smaller and even though the chances are small to begin with, there are times when you just can't accept it at all.
[0]: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...
[1]: https://www.usenix.org/legacy/events/fast08/tech/full_papers...
[2]: http://storageconference.us/2006/Presentations/39rWFlagg.pdf
The author seems to misunderstand the purpose of snapshots. As frequently [1] pointed out, snapshots are not in fact backups and should not be used for longer term storage.
Also the same argument can be used on Backups: "Backups may help, but they depend on the damage being caught before the backup of the good data is removed. If you save something and come back six months later and find it’s damaged, your backups might just contain a few months with the damaged file and the good copy was lost a long time ago."
[1] http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-...
I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is made to approach CRC properties without the computational overhead usually associated with CRC.
Without a checksum, there is no way to tell if the data you read back is different from what you wrote down. As you said corruption can happen for a variety of reason – due to bugs or HW failure anywhere in the storage stack. Just like other filesystems not all types of corruption will be caught even by ZFS, especially on the write to disk side. However, ZFS will catch bit rot and a host of other corruptions, while non-checksumming filesystems will just pass the corrupted data back to the application. Hard drives don’t do it better, they have no idea if they’ve bit rotted over time and there are many other components that may and do corrupt data, it’s not as rare as you think. The longer you hold data and the more data you have the higher the chance you will see corruption at some point.
I want to do my best to avoid corrupting data and then giving it back to my users so I would like to know if my data has been corrupted (not to mention I’d like it to self-heal as well which is what ZFS will do if there is a good copy available). If you care about your data use a checksumming filesystem period. Ideally, a checksumming filesystem that doesn’t keep the checksum next to the data. A typical checksum is less than 0.14 Kb while a block that it’s protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to detect corruption all day, any day. Now let’s remember ZFS can also do in-line compression which will easily save you 3-50% of storage space (depending on the data you’re storing) and calling a checksum a “waste of space” is even more laughable.
I do want to say that I wholeheartedly agree with “Nothing replaces backups” no matter what filesystem you’re using. Backing up between two OpenZFS pools machines in different physical location is super easy using zfs snapshot-ting and send/receive functionality.
ZFS was created to solve actual business problems.
Here's a quote:
- “ZFS has CRCs for data integrity
A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.
This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."
Meanwhile in reality...
ZFS does not use CRCs for checksums.
It's very hard to take someone's view seriously when they are making mistakes at this level.
ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.
- "Hard drives already do it better"
No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.
It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.
ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.
Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.
- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.
Well, that and entire disk failures.
And power failures leading to inconsistency on the drive.
And cable faults leading to the wrong data being sent to the drive to be written.
And drive firmware bugs.
And faulty cache memory or faulty controllers on the hard drive.
And poorly connected drives with intermittent glitches / timeouts in communication.
You get the idea.
I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).
It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.
It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...
ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).
It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.
Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]
Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.
What next?
"Why even use C? Everything you can do in C, you can do in PHP anyway!"
I have no idea what the problem is with this server. There are no SMART failures or kernel messages indicating hardware failure, and the system doesn't hard-crash. The thing is, I don't actually have to care, because ZFS is actively taking care of the problem. Until one of the disks goes so bad that SMART or the kernel's SATA layer or ZFS can point me at it, I can just passively let ZFS continue protecting me.
If this were a RAID, the first risk is that the RAID system wouldn't have a scrub command at all. Some do, but not all. Without such a command, those on-disk ECCs the author heaps so much praise on won't help him. I've got the same ECCs backing my ZFS, and clearly the data is getting corrupted anyway, somehow.
Let's keep the author's context in mind, which is apparently that we're going to use motherboard or software RAID, since he's budgeted $0 for a hardware RAID card, so the chances are higher that there is no scrub or verify command.
If our RAID implementation does happen to have a scrub or verify command, it might be forced to just kick one of the disks out or mark the whole array as degraded, depending on where in the chain the corruption happened. If it does that, it'll take a whole lot longer to rewrite one of the author's cheap 3 TB disks than it took ZFS on my file server to fix the few megs of corrupted blocks.
And that's not all. I have a second anecdote, the plural of which is "data," right? :)
Another ZFS-based system I manage had a disk die outright in it. SMART errors, I/O timeouts, the whole bit. Very easy to diagnose. So, I attached a third disk in an external hard disk enclosure to the pained ZFS mirror, which caused ZFS to start resilvering it.
Before I go on, I want to point out that this shows another case where ZFS has a clear advantage. In a typical hardware RAID setup, a 2-disk mirror is more likely to be done with a 2-port RAID card, because they're cheaper than 4-port and 8-port cards. That means there is a very real chance that you couldn't set up a 3-disk mirror at all, which means you're temporarily reduced to no redundancy during the resilver operation. Even if you've got a spare RAID port on the RAID card or motherboard, you might not have another internal disk slot to put the disk in. With ZFS, I don't need either: ZFS doesn't care if two of a pool's disks are in a high-end RAID enclosure configured for JBOD and the third is in a cheap USB enclosure.
The point of having a temporary 3-disk mirror is that the dying disk wasn't quite dead yet. That means it was still useful for maintaining redundancy during the resilvering operation. With the RAID setup, you might be forced to replace the dying disk with the new disk, which means you lose all your redundancy during the resilver.
Now as it happens, sometime during the resilver operation, `zfs status` began showing corruptions. ZFS was actively fixing them like a trooper, but this was still very bad. It turned out that the cheap USB external disk enclosure I was using for the third disk was flaky, so that when resilvering the new disk, it wasn't always able to write reliably. I unmounted the ZFS pool, moved the new disk to a different external USB disk enclosure, re-mounted the pool, and watched it pick the resilvering process right back up. Once that was done, I detached the dying disk from the mirror and did a scrub pass to clear the data errors, and I was back in business having lost no data, despite the hardware actively trying to kill my data twice over.
There are still cases where I'll use RAID over ZFS, but I'm under no illusions that ZFS has no real advantages over RAID. I've seen plenty of evidence to the contrary.
Tiny nitpick but though Oracle now owns and develops ZFS, Sun Microsystems was the company that initially designed and implemented it. They worked on it for 5 years after they released it, before Oracle acquired them.
Unless most journaled filesystems, ZFS:
- allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).
- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].
Other similarities/differences:
In a journalling FS, you need to take the filesystem offline and check the journal. In ZFS, there is continual passive checking of file data and metadata at time of access, as well as the option for an online 'scrub' that is similar to the fsck of a journalled filesystem without requiring dismounting of the filesystem.
While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.
The far more risky situations involve reading back data.
A properly-optimized RAID or RAID-like system will read half the blocks from one disk and half from the other when dealing with a 2-disk mirror.
With RAID-1, if the data blocks read cleanly from one disk — that is, the hard disk's ECC does its thing, as the author expects — but the data bytes are then corrupted in RAM during the DMA transfer, RAID won't detect the problem. Your application will simply have errors in those blocks, and it'll be oblivious to the problem unless there is some corruption detection ability in the data format.
With a ZFS mirror, things are different. If the blocks are cleanly read from the disk (again according to those in-drive ECC checks) but the bytes are corrupted during the DMA transfer to RAM, ZFS will detect it, because it always double-checks the hashes — cryptographycally-strong hashes, mind, not CRCs, as the author misstates — after reading the data in from disk. This will cause ZFS to attempt a second read from the corresponding block in the other side of the mirror. Assuming you don't get a second RAM corruption, the checksum will match this time, so ZFS will re-write the clean block to the first disk. ZFS is incorrectly assuming it was the drive that corrupted the block, but it doesn't matter because all that happens is a correct block is overwritten with the same correct block.
Now let's take a trickier case. What if your RAM is so flaky that it re-corrupts the clean block on its way back out to the first disk during this unnecessary re-write? ZFS will write the correct checksum along with that block's data, so that when it comes time to re-read that block, the checksum won't match the data. It doesn't matter whether the RAM corrupts the checksummed data or the checksum itself, because the odds are astronomically against both being corrupted in a way that causes the two to match. When ZFS is told to re-read that corrupted block, either by the application or by a background scrub, it will again decide it needs to overwrite the first disk's copy of the block with the copy from the second disk, which this time is in fact corrupted on-disk. Unless your RAM corrupts the data a third time, this time it will write the correct data to disk.
RAID can't do any of that. All RAID can do is say, "These two blocks don't match each other, but both have good on-disk ECC, so PANIC." Different RAID implementations do different things here. Some will just mark the array as degraded and force the operator to choose one disk to mirror back onto the other. If the operator guesses wrong, you've got two copies of the bad data now.
ZFS doesn't have to guess: it knows which copy is wrong with astronomical odds in favor of being correct.
In a proper setup (mount options journal=writeback,noatime,relatime, wal configured reasonably wrt max_wal_size/checkpoint_segments) the overhead due to ext4 journaling shouldn't be a major factor. You'll see some overhead initially when the WAL segments are allocated as you go, but after that they'll be recycled.
For OLTP write heavy databases I'd say the intent log is more a liability than an advantage, it's easy to screw over performance and/or storage lifetime with it.
Not that online scanning makes to much sense anyway. The good filesystems verify sanity of the structure they traverse, so might as well put in a full FS read in cron. Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.
ZFS scrub is not the same thing.
If you do a full-filesystem read in a RAID system at the OS level, the redundant blocks won't be read: the RAID system will simply choose one of the copies to read based on which disk(s) is least heavily loaded at the moment. This is why reading on a 2-disk mirror is twice as fast as reading from a single one of the disks comprising the mirror.
During a ZFS scrub, all copies of every block are checked, and because the data is heavily checksummed, ZFS knows which copy is right if one of the 2+ redundant copies doesn't match its checksum.
Additionally, ZFS is structured as a Merkle tree (https://en.wikipedia.org/wiki/Merkle_tree) which avoids whole classes of ways traditional filesystems can become deranged at a structural level. ZFS always stores 3+ copies of certain types of filesystem metadata, even on a 1-disk ZFS pool, so that if one gets corrupted, it has 2+ others to choose from. When this same type of corruption happens on a traditional filesystem, well, let's just say that's why `/lost+found` exists.
> Most kinds of damage cannot be repaired on a live filesystem anyway.
See my post above, giving two anecdotes of ZFS actively repairing data on live filesystems. Both systems were in continuous use while these repairs proceeded, and no data were lost in either.
You're totally wrong.
The easiest way to demonstrate why is for you to set up a script to randomly write zeros/junk in any amount, at any time, anywhere over one of the block devices being used by ZFS, all day every day.
[Assuming you're using one of the available forms of redundancy i.e. multiple copies, ZRAID1/2, or mirroring etc.]
Sit back and watch ZFS giving no fucks at all as it repairs all the damage passively.
You can even introduce such damage in moderate quantities across all of the block devices used by ZFS. Again, you'll see a goddamn incredible amount of self-healing going on and accurate reporting about where it's unable to recover files due to the damage across multiple volumes being too extensive.
It's unlikely that even in this extreme instance of willful massive harm to the disks you'll see the filesystem being damaged because a) filesystem metadata is checksummed too b) the metadata blocks are automatically stored twice in different places c) you also have the redundancy of multiple devices e.g. mirroring/zraid.
Try it, prove me wrong.
The first anecdote is about a TrueOS box — which previews what will become FreeBSD 12 — and the second is about a macOS Sierra box running OpenZFS on OS X.
Since TrueOS, O3X and ZoL are all based on OpenZFS, I expect that you will have the opportunity to replicate my experiences should you have disks that die. Now I don't know whether to wish you good luck or not. :)