A data corruption bug in OpenZFS?

A data corruption bug in OpenZFS?(despairlabs.com)

220 points by moviuro 2 years ago | 111 comments

cesarb 2 years ago |

IMO, part of the issue is that something which used to be just a low-level optimization (don't store large sequences of zeros) became visible to userspace (SEEK_HOLE and friends). Quoting from this article:

"This is allowed; its always safe to say there’s data where there’s a hole, because reading a hole area will always find “zeroes”, which is valid data."

But I recall reading elsewhere a discussion about some userspace program which did depend on holes being present in the filesystem as actual holes (visible to SEEK_HOLE and so on) and not as runs of zeros.

Combined with the holes being restricted to specific alignments and sizes, this means that the underlying "sequence of fixed-size blocks" implementation is leaking too much over the abstract "stream of bytes" representation we're more used to. Perhaps it might be time to rethink our filesystem abstractions?

codys 2 years ago | |

> But I recall reading elsewhere a discussion about some userspace program which did depend on holes being present in the filesystem as actual holes (visible to SEEK_HOLE and so on) and not as runs of zeros.

"treatment of on-disk segments as "what was written by programs" can cause areas of 0 to not be written by bmaptool copy":

https://github.com/intel/bmap-tools/issues/75

IMO, the issue here isn't filesystem or zfs behavior, it's that bmap-tool wants an extra "don't care bit" per block, which filesystems (traditionally) don't track, and programs interacting with filesystem don't expect to exist.

Some of the comments I've made in this issue describe options to make things better.

(FWIW: the original hn link discusses a different issue around seek hole/data, and the bmap-tool issue is backwards from the issue the parent posits: bmap-tool relies on explicit runs of zeros written not being holes, and particular behavior from programs writing data)

ajross 2 years ago | |

Indeed, sparse files are simply a mistake to have included in Unix in the first place (I think we blame this on early SunOS? Not sure, though almost certain that 3BSD and v7 didn't have them). Yes, they have been used productively for various tricks, but they create a bunch of complexity that every filesystem needs to carry along with it. It's a bad trade.

retrac 2 years ago | | |

Sparse files make more sense if you see the file system and paging as unified. If you have allocated an array of 1 billion items, accessing the last item doesn't make the OS zero out everything from 0th to the billionth item, allocating millions of pages along the way. Virtual emory is sparse; so just one page of virtual memory is allocated. Mmap'd sparse files behave the same way.

cogman10 2 years ago | | |

This a feature I was completely unaware of. Why would you choose to use a sparse file instead of multiple files?

mgerdts 2 years ago |

When I think of a fs corruption bug, I think of something that causes fsck/scrub to have some work to do, sometimes sending resulting in restore from backups. From the early reports of this, I was having a hard time understanding how it was a corruption bug. This excellent write up clears that up:

> Incidentally, that’s why this isn’t “corruption” in the traditional sense (and why a scrub doesn’t find it): no data was lost. cp didn’t read data that was there, and it wrote some zeroes which OpenZFS safely stored.

dannyw 2 years ago |

Fascinating write up. As someone with a ZFS system, how can I check if I’m affected?

moviuro 2 years ago | |

It's a very rare race condition, odds are very low that you were impacted. If you were, you would have noticed (heavy builds with files being moved around where suddenly files are zero).

[0] https://bugs.gentoo.org/917224

[1] https://github.com/openzfs/zfs/issues/15526 (referenced in the article)

dist-epoch 2 years ago | |

https://github.com/openzfs/zfs/issues/15526#issuecomment-181...

> zpool get all tank | grep bclone

> kc3000 bcloneused 442M

> kc3000 bclonesaved 1.42G

> kc3000 bcloneratio 4.30x

> My understanding is this: If the result is 0 for both bcloneused and bclonesaved then it's safe to say that you don't have silent corruption.

keep_reading 2 years ago | | |

bclones were only one way to trigger the corruption. This is not a good way to check.

It's also not worth checking for because this bug has existed for many years. Your data probably wasn't affected. None of the massive ZFS storage companies out there ran into it by now either.

Your data is fine. Sleep easy.

LanzVonL 2 years ago |

It's important to note that the recent showstopper bugs have all been in OpenZFS, with the Oracle nee Sun ZFS being unaffected by either.

frankjr 2 years ago |

I wonder if any large storage provider has been affected by this. I know Hetzner Storage Box and rsync.net both use ZFS under the hood.

mappu 2 years ago | |

Wasabi Cloud Storage have a Sponsored-By tag on the git commit fixing the issue, so I assume they're highly involved somehow.

joshxyz 2 years ago |

anyone know what diagram tool did he use? thanks

egberts1 2 years ago | |

Plantuml, doable in.

guiambros 2 years ago | | |

Any idea which diagram in PlantUML more specifically? I looked at a handful of the PlantUML categories (each one with dozens of examples) and haven't seen anything like the diagrams in OP's post.

commandersaki 2 years ago |

Excellent writeup robn!

lupusreal 2 years ago |

Is anybody using bcachefs yet?

frankjr 2 years ago | |

I'm keeping an eye on it but it's not there yet e.g. https://github.com/koverstreet/bcachefs/issues/619#issuecomm...

ktm5j 2 years ago | | |

Well, to be fair they tried and failed to reproduce the corruption that was reported. While I agree that I'm not ready to dive into bcachefs, I'm not exactly swayed by this bug report.

MenhirMike 2 years ago |

Periodic reminder to check if your backups are working, and if you can also restore them. It doesn't matter which file system or operating system you use, make sure to backup your stuff. In a way that's immune to ransomware as well, so not just a RAID-1/5/Z or another form of hot/warm storage (RAID is not a backup, it's an uptime/availability mechanism) but cold storage. (I snapshot and tar that snapshot every night, then back it up both on tape and in the cloud.)

hulitu 2 years ago |

> This whole madness started because someone posted an attempt at a test case for a different issue, and then that test case started failing on versions of OpenZFS that didn’t even have the feature in question.

One will expect more seriosity from filesystem maintainers and serious regression testing before a release.

amelius 2 years ago | |

Shouldn't we expect formal verification methods, even? Or is that too much to ask for?