ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot

ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot(nctritech.wordpress.com)

19 points by gphreak 8 years ago | 37 comments

kabdib 8 years ago |

> While it is true that keeping a hash of a chunk of data will tell you if that data is damaged or not, the filesystem CRCs are an unnecessary and redundant waste of space ...

A few years ago I, when I was on a game console team, a hardware engineer came to my desk and said, "Can you find out what's wrong with this disk drive?" It had come from a customer whose complaint was that games sometimes failed to download and game saves became unreadable.

I spent a fun afternoon tracking down what turned out to be a stuck-at-zero bit on that drive's cache. Just above the drive's ECC-it-to-death block storage was this flaky bit of RAM that was going totally unchecked. The console had a Merkle-tree based file system and easily detected the failure, but without that addition checking the corruption would have been very subtle, most of the time.

Okay, so that's just one system out of millions, right? What are the chances? Well, at the scale of millions, pretty much any hole in data integrity is going to be found out and affect real, live customers at some not insignificant rate. You really shouldn't be amazed at the number of single-bit memory errors happening on consumer hardware (from consoles to PCs -- and I assume phones). You should expect these failures and determine in advance if they are important to you and your customers.

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

rgbrenner 8 years ago | |

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.

If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.

AstralStorm 8 years ago | | |

You have worse problems that a filesystem won't catch if RAM gets randomly corrupted. Including said CRC check itself getting corrupted or code writing putt data structures to disk being wrong. Neither of those is caught by CRC better than by a dirty bit. It so happens that journaling file systems already have a degree of redundancy for writes built into them unless you defeat it.

Buge 8 years ago | |

But did the console's software checking help in that case? Either way you're going to have a customer complaining about problems.

X86BSD 8 years ago | |

Consumer hardware is notriously busted. Even most of the enterprise hardware isn't flawless. Firmware bugs etc. "Your hardware is actively trying to kill your data and ZFS job is to prevent it." To paraphrase Allan Jude.

asveikau 8 years ago |

A few years ago I had a drive at home that was flipping bits, randomly corrupting my files. It inspired me to build a ZFS disk server and introduce redundancy in my home setup.

A bunch of this article reads as if this scenario, which I in fact hit, won't happen, drives do it better, etc. But it happens. It happened to me. The drive did not "magically fix itself", and instead got worse over time. With ZFS, if it happens again, I can be told where it happened, exactly what files are affected, etc., and that's already better than what I got with that other disk which didn't have ZFS.

Plus the ZFS tools like snapshotting, send/receive, scrub being able to check integrity while the system is running... Those are great features.

Mindless2112 8 years ago |

As someone who has lost some files to a silently malfunctioning hard disk in the past, I think I'll stick with ZFS. Checksumming, RAID-Z, and periodic scrubbing would have saved my files. Even having backups did not -- after all, what good is a bit-for-bit copy of a corrupted file?

(On a side note, ZFS -- at least OpenZFS -- doesn't support any CRC algorithms for use as its checksum.)

AstralStorm 8 years ago | |

Mostly periodic scrubbing and patrol reads I reckon. Which is as required with RAID without ZFS.

wyoung2 8 years ago | | |

Scrub/verify/patrols, whatever you want to call it, with RAID all it can do is say, "Well shit, these two copies don't match. What do you want me to do about it, boss?"

ZFS doesn't have to guess which copy is wrong. It knows, and it will automatically replace it.

More, ZFS will even do this on a ZFS mirror when reading half the data blocks from one disk and half from the other, because it reads the cryptographically-strong checksums in with each data block and checks them before delivering the data to the application. If the checksum doesn't match, it rewrites that block from the redundant copy on the other disk(s).

RAID can't do that. If one of a mirror's data blocks is corrupted on disk but with a correct ECC, so that the two blocks don't match but both read cleanly, RAID can't tell which one is correct, so it'll typically just force the system administrator to choose one disk to overwrite the other with. That exchanges astronomical odds against incorrect data for coin flip odds against.

rgbrenner 8 years ago |

For an article with that tone, you would think the author would have more experience. It's literally filled with flawed and uninformed or inexperienced thinking.

From the idea that SMART reliably detects hard drive failures.. to dismissing data protection for no reason other than it sounds unlikely to the author (which in several cases I know personally to be false... because I've experienced those failures).

ZFS is a very well designed filesystem. Things weren't added haphazardly or because they sounded cool. The author would do well to try to understand why those protections were added.

AstralStorm 8 years ago | |

Almost all of the protections are also afforded by plain old RAID without ZFS. Why waste space on a CRC when you still get to run a redundancy check? If FS structure is corrupted CRC won't save you anyway. An FSCK might instead.

DiabloD3 8 years ago |

This entire article can be summarized as the following: RAID is not a replacement for backups.

Sun/Oracle, and a lot of popular third party documentation, has said as such very openly, and commands like zfs send/recv exist to easily automate zfs cloning (to backup from one zfs fs to another, for example, if you choose to do it that way).

I suspect whoever wrote this missed the boat on why zfs works.

notacoward 8 years ago |

Totally off base, on several points. Any kind of checksum on the disk only protects what gets to the disk. Filesystem-level CRCs can protect the entire data path. If you have a defect in your RAID card or HBA, or anywhere in the software stack below the filesystem, on-disk CRCs will happily "validate" the already-corrupted data while filesystem-level CRCs are likely to detect the corruption. The author dismisses it as a "remotely likely scenario" but I've seen it happen for real many times. Maybe that's because I have about 3.5x as many years of experience as the author, across what's probably thousands of times as many machines or drives (I've worked on some big system).

The same "I've never seen it so it's not real" fallacy appears again in the discussion of RAID 5. He says that losing a second drive during a rebuild is "statistically very unlikely" but that's not so. Not only have I seen it many times, but the simple math of disk capacities and interface speeds shows that it's not really all that unlikely. I've seen RAID 6 fail because of overlapping rebuild times, leading people to push for more powerful erasure-coding schemes. Over the lifetime of even a medium-sized system, concurrent failures on RAID 5 are likely enough to justify using something stronger.

I was one of the earliest and most outspoken critics of ZFS hype and FUD when it came out. It was and is no panacea, but that doesn't justify more FUD in the other direction to sell backup products or services.

Veratyr 8 years ago |

While he's right that it's not as big an issue as ZFS fanatics make it out to be, it _is_ a real issue and they're not just pulling it out their asses. There are a number of studies that actually measured the error rate, some of the bigger ones being done by CERN [0], NetApp [1] and IA (I think there's meant to be a talk or something to go with this one) [2].

ZFS certainly isn't a magic wand you should wave at anything and everything and it doesn't replace backups but it does make the chances of something going wrong undetected much smaller and even though the chances are small to begin with, there are times when you just can't accept it at all.

[0]: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...

[1]: https://www.usenix.org/legacy/events/fast08/tech/full_papers...

[2]: http://storageconference.us/2006/Presentations/39rWFlagg.pdf

X86BSD 8 years ago | |

Actually it does replace backups with replication and/or cloning.

bbatha 8 years ago | | |

Its not backed up until its at least on an external system, ideally in triplicate off-box, off-site, and cold storage. Cloning and replication makes it easier to backup but is no substitute.

Veratyr 8 years ago | | |

That's not a replacement for backups, that's an implementation of backups and only if you send it to an offline disk or remote system.

ATsch 8 years ago |

>Snapshots may help, but they depend on the damage being caught before the snapshot of the good data is removed. If you save something and come back six months later and find it’s damaged, your snapshots might just contain a few months with the damaged file and the good copy was lost a long time ago.

The author seems to misunderstand the purpose of snapshots. As frequently [1] pointed out, snapshots are not in fact backups and should not be used for longer term storage.

Also the same argument can be used on Backups: "Backups may help, but they depend on the damage being caught before the backup of the good data is removed. If you save something and come back six months later and find it’s damaged, your backups might just contain a few months with the damaged file and the good copy was lost a long time ago."

[1] http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-...

OpenZFSonLinux 8 years ago |

This blog post was deleted hours after I posted the following comment rebuking most of what was said:

I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is made to approach CRC properties without the computational overhead usually associated with CRC.

Without a checksum, there is no way to tell if the data you read back is different from what you wrote down. As you said corruption can happen for a variety of reason – due to bugs or HW failure anywhere in the storage stack. Just like other filesystems not all types of corruption will be caught even by ZFS, especially on the write to disk side. However, ZFS will catch bit rot and a host of other corruptions, while non-checksumming filesystems will just pass the corrupted data back to the application. Hard drives don’t do it better, they have no idea if they’ve bit rotted over time and there are many other components that may and do corrupt data, it’s not as rare as you think. The longer you hold data and the more data you have the higher the chance you will see corruption at some point.

I want to do my best to avoid corrupting data and then giving it back to my users so I would like to know if my data has been corrupted (not to mention I’d like it to self-heal as well which is what ZFS will do if there is a good copy available). If you care about your data use a checksumming filesystem period. Ideally, a checksumming filesystem that doesn’t keep the checksum next to the data. A typical checksum is less than 0.14 Kb while a block that it’s protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to detect corruption all day, any day. Now let’s remember ZFS can also do in-line compression which will easily save you 3-50% of storage space (depending on the data you’re storing) and calling a checksum a “waste of space” is even more laughable.

I do want to say that I wholeheartedly agree with “Nothing replaces backups” no matter what filesystem you’re using. Backing up between two OpenZFS pools machines in different physical location is super easy using zfs snapshot-ting and send/receive functionality.

zlynx 8 years ago |

He missed all the history of ZFS too. Sun had actual customers with bit rot. Even though they were running systems with the highest types of server hardware Sun provided, they had invisible data errors which were only noticed when the files were used and analysis showed ECC passing bit errors.

ZFS was created to solve actual business problems.

random_comment 8 years ago |

This entire article can be summarised as 'guy who has never used ZFS and has no idea whatsoever about how it works writes a critique that exposes their ignorance publicly'.

Here's a quote:

- “ZFS has CRCs for data integrity

A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.

- "Hard drives already do it better"

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.

It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.

ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.

Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.

- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.

It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...

ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).

It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.

Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.

What next?

"Why even use C? Everything you can do in C, you can do in PHP anyway!"

Jaepa 8 years ago |

I think one of the universal truths in tech is that, those for it, and those annoyed by it both kind of miss the point.

X86BSD 8 years ago |

I think what bothers me most is this person owns a computer related business. He is actively endangering people's data out of willful ignorance. It's highly unethical.