OpenZFS 2.0

364 points by ascom 5 years ago | 148 comments

Has there been any progress on the zfs on linux Linus disagreement front since this article?

https://arstechnica.com/gadgets/2020/01/linus-torvalds-zfs-s...

ed25519FUUU 5 years ago | |

zfs on linux available as root partition since 20.04. Working quite well I might add!

3np 5 years ago | | |

That’s Ubuntu-specific where they provide their own kernel bundled with ZFS. It was working fine before 20.04 as well in the same way it does for other distros. Has nothing to do with the comment you’re replying to.

yjftsjthsd-h 5 years ago | | |

It always worked, the question is how much work they have to do to work around a kernel that dislikes them.

gilrain 5 years ago | | |

It's still a big pain if you like to keep your kernel relatively up to date. I switched to btrfs; it just working is worth the few extra warts over ZFS.

tpetry 5 years ago |

Zstd compression with configurable levels is really interesting: You could write every block first with a level comparable to lz4 for very fast performance. And if a block has not been rewritten for some time you recompress them with a compression level allowing more compression and comparable decompression performance.

So cold data (cold write, cold/hot read) will take less and less space over time while still having the same read performance.

KMag 5 years ago | |

That would be an even more interesting feature for NILFS2, as I understand it, its ring buffer structure requires moving the oldest unmodified blocks as the ring buffer write frontier approaches. Any blocks that are forced to be copied are by definition old and unmodified, and need to be moved anyway, so why not recompress? AFAIK, there are no plans for compression in NILFS, but I think it's an interesting idea.

rcthompson 5 years ago | |

My understanding is that for ZFS, things like this would require a mythical feature called "block pointer rewrite", the same feature required to implement out-of-band deduplication.

rincebrain 5 years ago | | |

You are correct - ZFS hardcodes the assumption that data's location on disk will never change once written very deeply, and offline dedup/data migrating of any sort would require that.

(It would also be a performance nightmare - you'd have a permanent indirection table you'd need to use for _everything_, and if you've ever seen how ZFS dedup performs with its indirection table not on dedicated SSDs, you can understand why this is terrible.)

tpetry 5 years ago | | |

The block could still be rewritten from the view of zfs as long as it does not update the last-written timestamp (does zfs have this?). I was just describing how it would look like from the birds eye.

throw0101a 5 years ago |

Sadly dRAID (parity Declustered RAIDz) just missed the cut-off for 2.0, but it looks like it will be in 2.1:

* https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAI...

* https://www.youtube.com/watch?v=jdXOtEF6Fh0

Nican 5 years ago | |

dRAID looks really fascinating, but presentation is pretty abstract. Would it allow to add/remove drives from a pool, and allow ZFS to rebalance itself?

Would be great for home use, where I have a lot of drives that I collected over the years that are not the same size.

EDIT: The more I read into this, it still seems assume that all drives must be of the same size.

diegocg 5 years ago | | |

I don't think so. The essence of draid is that, instead of keeping a spare drive unused in case one of the working drives fail, it incorporates the spare drive to the array and uses it, but one drive worth of free space is reserved randomly across the entire array.

That way, if one disk fails, the reserved space is used to write the data necessary to keep the array consistent. Because the free space is distributed randomly across the array, the write performance of a single drive doesn't become a bottleneck.

This is unrelated to the ability to remove drives from a pool (which is difficult to support in ZFS due to design constraints)

hardwaresofton 5 years ago | | |

Maybe this presentation by Mark will help?

dRAID, Finally![0]

[0]: https://www.youtube.com/watch?v=jdXOtEF6Fh0

ecnahc515 5 years ago | | |

This sounds like synology hybrid raid, which uses lvm and mdadm together for something similar if I recall.

pantalaimon 5 years ago | | |

You can already do that with btrfs.

codetrotter 5 years ago |

This is huge! And very exciting :D

One thing I am wondering about is this:

> Redacted zfs send/receive - Redacted streams allow users to send subsets of their data to a target system. This allows users to save space by not replicating unimportant data within a given dataset or to selectively exclude sensitive information. #7958

Let’s say I have a dataset tank/music-video-project-2020-12 or something and it is like 40 GB and I want to send a snapshot of it to a remote machine on an unreliable connection. Can I use the redacted send/recv functionality to send the dataset in chunks at a time and then at the end have perfect copy of it that I can then send incremental snapshots to?

kogir 5 years ago | |

zfs send supports a resume token (-t) to resume interrupted streams received with (-s). Just use normal send/receive until you have the full stream sent.

0xCMP 5 years ago | |

I think it's more if you want to not send scratch or cached files you can have it automatically remove it from the snapshot being sent

> Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots).

Think of it like a selective sync in Dropbox or SyncThing at the FS level.

vorpalhex 5 years ago | |

That's a protocol problem, use a protocol such as rsync. You don't need to use redacted sends/recvs.

rleigh 5 years ago | | |

rsync doesn't scale like zfs send/recv. It requires scanning of every file at both the source and destination to compute the delta to send. zfs snapshots and send/recv don't need to do that. The delta is already fully described by the snapshots themselves. zfs is also working with immutable snapshots. It guarantees the source and destination copies are identical; rsync can't do much about the source and destination being modified while it is running since it's reliant upon other users of the system not touching the data being synced.

That's not to say rsync doesn't work. It does. But it doesn't scale well, and the data integrity guarantees aren't there.

XorNot 5 years ago | | |

rsync has it's own issues if the connection has high latency though - zfs send was originally developed by a Sun engineer who wanted to speed up large transfers to servers in China, if I recall correctly.

nix23 5 years ago | | |

+1 for rsync, but with check-summing turned on, i think that's acceptable for 40GB.

anderspitman 5 years ago |

I'd love to get rid of my FreeNAS VM and run ZFS directly on my Linux desktop, but having to mess with the kernel has kept me from attempting it so far. Maybe I'm worrying about nothing.

btrfs seems like the main alternative if you want native kernel support, but when I checked a couple years ago there seemed to be a lot of concerns about the stability. Is that still the case?

qalmakka 5 years ago |

Finally, this means we've a way to share "real" filesystems on both FreeBSD and Linux. The only other filesystems you could open without issues on both are FAT and NTFS (thought NTFS-3G), both of which are less than ideal for data you care about.

justinclift 5 years ago |

Slightly off topic, but it seems like GitHub can't/won't display the user profile page for one of the OpenZFS developers:

https://github.com/behlendorf

For me, that gives a unicorn 100% of the time (tried across several minutes), instead of showing the developer profile.

Anyone else seeing that?

jclulow 5 years ago | |

It does, indeed, report that "This page is taking too long to load."!

justinclift 5 years ago | | |

Yeah, it's still unicorning for me, about a day later. :(

rincebrain 5 years ago | |

Loaded in under 5 seconds flat for me, perhaps it's something strange with whatever edge server you're hitting?

justinclift 5 years ago | | |

Could be, but if so it's persistent. It's about a day later now, and the page still won't load.

bromonkey 5 years ago | |

It loaded for me earlier today, I think github is just having issue.

rodgerd 5 years ago |

Congratulations - it's great to see the code unification on the two key ZFS platforms, and continuing to add useful features, especially around at-rest encryption.

Many thanks to the various OpenZFS contributors.

KMag 5 years ago |

How's the memory consumption of ZFS without deduplication these days? I've got a couple of 4 TB drives connected to a single board ARM computer with 2 GB of RAM. I used to use btrfs, but switched to XFS after I accidentally filled up a drive and was unable to recover.

rincebrain 5 years ago | |

ZFS without dedup will just run slower with less RAM available for caching, up to a point (I think the lowest I've seen someone run it with ARC configured to use in recent memory is 128 MB? I believe 32 MB or so is the minimum below which OpenZFS will just ignore you if you try to tell it to use less...)

I've seen people use it as a rootfs on RPis, and have personally run it on Pis for brief occasions without encountering any RAM problems.

mholt 5 years ago |

I'm looking at setting up my first ZFS pool ('zpool'?) in a few weeks, on Linux. Will I be using OpenZFS or something else? Ubuntu 20.04.

(Sorry if noise; I'm just trying to get an idea of how relevant this 2.0 release is to me.)

iotku 5 years ago | |

> The ZFS on Linux project has been renamed OpenZFS! Both Linux and FreeBSD are now supported from the same repository making all of the OpenZFS features available on both platforms.

Previously it was called ZFS on Linux, but now ZFS development is unified on the "OpenZFS" codebase shared both between Linux and FreeBSD as much of the development effort for ZFS in general ended up there.

mholt 5 years ago | | |

Ah, I was wondering what happened since I stopped hearing about "ZFS on Linux" so now I know what to search for. Thanks!

mlex 5 years ago |

Just built a FreeNAS system over the past couple weeks and finished doing burn-in tests of my hard drives, wonder if I should wait and see how to install OpenZFS 2.0.0 before I create my storage config.

1over137 5 years ago | |

FreeNAS 12 (now named TrueNAS) is already using OpenZFS 2.0, or very nearly.

nraynaud 5 years ago | | |

Does it support NFS4.2?(fallocate, sparse files and server side copy)

ed25519FUUU 5 years ago | |

Aren't ZFS upgrades to existing vdevs really simple? I don't see any reason why you need to wait.

mlex 5 years ago | | |

That’s the idea I’ve gotten when looking around online. I figured I was in the uncommon situation of having a completely blank and ready system, so I could afford to just wait a few days.

1over137 5 years ago | | |

Yes, ZFS upgrades are really simple, but they are one-way, you can't downgrade after.

rodgerd 5 years ago | | |

They certainly seem to be within OpenZFS over the past few years.

voltagex_ 5 years ago |

Anyone know what version of Ubuntu Server this will land in?

cogman10 5 years ago | |

Likely 21.04. I doubt they'll pull it into 20.10 or 20.04.

GlitchMr 5 years ago | |

Probably 21.04. 22.04 if you want an LTS release.

jstrong 5 years ago |

hooray for zstd compression!

ed25519FUUU 5 years ago |

Side note, they really should have in big-bold letters "DO NOT ENABLE DEDUPLICATION UNLESS YOU HAVE A TON OF RAM!" on their readme. That was a huge mistake on my part. The ram requirements are VERY high for good performance.

I realized how bad the performance was when it took about 2 hours to delete 1000 files.

freddie_mercury 5 years ago | |

It does already say that. This is what it says:

Deduplication is the process for removing redundant data at the block level, reducing the total amount of data stored. If a file system has the dedup property enabled, duplicate data blocks are removed synchronously. The result is that only unique data is stored and common components are shared among files.

Deduplicating data is a very resource-intensive operation. It is generally recommended that you have at least 1.25 GiB of RAM per 1 TiB of storage when you enable deduplication. Calculating the exact requirement depends heavily on the type of data stored in the pool.

Enabling deduplication on an improperly-designed system can result in performance issues (slow IO and administrative operations). It can potentially lead to problems importing a pool due to memory exhaustion. Deduplication can consume significant processing power (CPU) and memory as well as generate additional disk IO.

1over137 5 years ago | |

That's not new with 2.0 though. It's forever been the case with ZFS. Everything that discusses dedupe basically says: 'don't use it'.

Mashimo 5 years ago | |

Most guides I read tell you that you should not enabled DEDUP unless you know what you are doing and it will use a lot of ram.

zmix 5 years ago | |

To me this sounds more like you didn't RTFM ;-)

hlandau 5 years ago |

Will OpenZFS on Linux ever be integrated with the Linux page cache?

keeperofdakeys 5 years ago | |

Probably never. ZFS isn't just a filesystem, it was developed to be an entire storage system that's vertically integrated, so ARC is a fundamental part of the filesystem design.

ZFS also has a huge legacy. Right now the license (probably) prevents you from legally shipping a compiled zfs module with the linux kernel, just solving that seems insurmountable. It's also supported on Illumos and FreeBSD, trying to refactor it to use the linux page cache would have a chance of introducing bugs to these platforms.

RantyDave 5 years ago | |

ZFS isn't really designed for local 'temporary' file systems (IMHO). You don't really need to nest checksums, create snapshots or volume manage when you're slugging pages between ram and nvme.

nix23 5 years ago | |

No, they have ARC and ARCL2, if you want the traditional thing go to NILFS2 or BTRFS or in the future XFS (when they have full check-summing).

curt15 5 years ago | | |

>in the future XFS (when they have full check-summing).

Is this actually planned?

kzrdude 5 years ago |

OpenZFS is in fact a more prestigeous name and it already sounds better than ZFS on Linux.

solarengineer 5 years ago | |

If you get on the calls, you’ll find zero hostility across the operating systems devs. The focus is on OpenZFS, with the Linux branch gradually becoming baseline for the FreeBSD work as well. Illumos ( where OpenZFS originated after Illumos was formed post the OpenSolaris shutdown) hasn’t moved to this baseline yet due to the significant OS level differences and instead code is pulled between the “branches” as needed. The collaboration happens via email and regular calls.