Kioxia Demonstrates Raid Offload Scheme for NVMe Drives

Kioxia Demonstrates Raid Offload Scheme for NVMe Drives(anandtech.com)

36 points by vanburen 1 year ago | 53 comments

sliken 1 year ago |

Seems like quite a bit of complexity for a dubious win. Today's CPUs are REALLY fast, even a single core of 64 or more cores that are common on servers today:

  [  478.047970] raid6: avx2x2   gen() 60473 MB/s
  [  478.115971] raid6: avx512x1 gen() 53469 MB/s
  [  478.149971] raid6: avx512x2 gen() 57067 MB/s

Especially since the data is coming from the CPU anyways, so likely caches are warm. It also means that node you have to send a stripe of data to a single NVME, which likely has much less than 60GB/sec of checksum speed, then initiate transfers to every other drive in the stripe. Not to mention the NVME drive likely doesn't have ECC memory and any resulting memory errors are unlikely to be visible to the OS.

Just seems like hardware RAID with all the same problems, likely not as fast as software RAID, harder to manage, a unique set of tools per vendor, harder to have global spares, and doesn't work with filesystems that do their own redundancy like ZFS.

everfrustrated 1 year ago | |

The problem is not the CPU per se, but the PCI bandwidth congestion between CPU<->PCI lanes.

These NVME drives can talk directly to each other for raid which means a much larger total bandwidth is available, and potentially improved latency also.

RDMA means you might not be serving via CPU at all.

sliken 1 year ago | | |

The difference isn't that big though. Sure with software RAID you write 1GB to a 8 disk RAID6 you write 8/6 x 1GB = 1.33GB. But with RAID offload you nearly double the NVME bandwidth (n - 1) x 2 consumed.

I also wonder, if you have 8 NVMe, write to a stripe to one, it does the RAID calc and sends each disk the share of the stripe. What happens if the master NVMe dies? It's not really a RAID if a single disk can kill the RAID.

wmf 1 year ago | | |

PCIe P2P transfers go through the CPU so... no? I think what it's saving is main memory bandwidth.

zamadatix 1 year ago | |

Based on the numbers from the article it seems the problem is less how fast a CPU core can crunch numbers and more how much extra memory bandwidth it consumes to do so. Testing the AVX throughput of a single core in a storage only test skips that consideration because there is no memory bandwidth contention or usage consideration.

sliken 1 year ago | | |

Sure, but to write 1GB you stream 1GB from ram -> CPU in either case. With software RAID you do the calcs (60GB/sec per core) and then write 1.3GB/sec to the storage controller. Just doesn't seem that much of a difference, the CPU overhead is near zero (actual I/O / 64*60GB), and writing an extra 1/3rd for the redundancy data seems in the noise for normal server loads.

Not to mention I'd expect the parity calculations to be MUCH slower on the NVMe controllers.

topspin 1 year ago | |

> and doesn't work with filesystems that do their own redundancy like ZFS

Really? You somehow can't create a ZFS file system on an hardware RAID block device? Seems like that means the hardware RAID isn't the otherwise transparent block device it's supposed to be for the OS and whatever file systems it cares to employ.

You're concerns about management, tools and spares are correct for many use cases. Some uses cases, like cloud operators that don't suffer the burdens of long term management at that level of detail (where entire racks and generations of hardware are cycled in/out as a working unit, with ample spares at hand, under contract) won't care about that. They'll care about the nice efficiency gain. When you operate like that you can accommodate sophisticated integration such as this for efficiency gains.

sliken 1 year ago | | |

> Really? You somehow can't create a ZFS file system on an hardware RAID block device?

Sure you can do it, have two layers of checksums and a volume manager on top of a volume manager. But ZFS is designed to talk directly to block devices and try to detect and complain about the numerous failure modes. Like say a parity calc that goes awry because of a memory error.

For this and other reasons it's recommended that even with Hardware RAID it's recommended to configure it in JBOD mode.

I've also seen numerous cases where software RAID on top of hardware RAID running in JBOD mode is faster than just using hardware RAID.

> When you operate like that you can accommodate sophisticated integration such as this for efficiency gains.

Sure, if there are efficiency gains. If the strong bottleneck for writing to the controller is your limiting factor you might get a 33% increase in I/O. But for that to be true you need:

  * The bottleneck not to be elsewhere
  * The controller inside a NVMe device (often passively cooled) to be faster than the one on the CPU
  * The bandwidth between the PCI controller or PCIe switch and the NVMe controller to not care about a 2x increase in needed bandwidth

Seems unlikely to me.

mrktf 1 year ago |

I imagine these kind of schemes can be implemented as sort of on device eBPF filter (in layman terms CUDA, but for storage). It would allow deeper integration with system for example have hardware accelerated/integrated lvm (obviously speed would depend on use case, less win for thin volumes, more advantages for raid and so on). Or from other side have deeper integration with filesystems such as zfs, btrfs, bcachefs.

benlwalker 1 year ago | |

We tried to standardize exactly this - eBPF programs offloaded onto the device. The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs. But one of the blockers is that eBPF isn't itself standardized. The other blockers are vendors ready and willing to build these devices and customers ready to buy them in volume. The extra compute ability will introduce some extra cost.

I'm still hopeful that we see it happen some day.

doctorpangloss 1 year ago | | |

> The NVMe standard now has a lot of infrastructure for this standardized, including commands to discover device memory topology, transfer to/from that memory, and discover and upload programs.

On the other hand, Windows and Linux still cannot just upgrade the vast majority of firmwares on NVMe devices, least of all consumer ones, despite being completely and utterly standardized.

You have to wonder, if Samsung makes bullshit, and then this https://github.com/chrivers/samsung-firmware-magic becomes part of the ecosystem, why trust the vendors with anything else?

sroussey 1 year ago | | |

So, could you upload malware to the drive that way?

cm2187 1 year ago | |

What I don't get is that RAID5 is a simple xor. It should be a trivial operation, that would be equally trivial to hardware accelerate.

What I am the most puzzled by is how parity (i.e. RAID5) is so bad in windows storage space. A modern CPU should be able to xor data at several gigabytes per second. And it seems that even by optimizing the block sizes, windows storage space parity caps at a couple hundred MB/s.

rkagerer 1 year ago | |

Is this similar to Graid's products that have been out for a while? They basically use a GPU as a raid controller

Dylan16807 1 year ago | | |

A GPU RAID card does give you some flexibility benefits. But it's also bottlenecked by the single slot.

przemub 1 year ago |

When I first bought Kioxia flash I thought it's a random Chinese knockoff. Shame they ditched Toshiba brand on these.

baruch 1 year ago |

NVMe drives fail at a fairly low rate so this is an optimization for a very small edge case and since they are also very fast it's not like you'll be doing a rebuild for 6+ hours like with HDDs.

It also doesn't change anything for distributed storage.

ein0p 1 year ago |

I’d much prefer NVMe colocated compute. Imagine a columnar storage engine able to filter and aggregate data during scans without reading it through PCIe, for example.

jakedata 1 year ago | |

ScaleFlux https://scaleflux.com computational storage might offer some of what you are imagining. Their NVMe drives have onboard ARM cores and perform hardware compression and advanced flash management with no drivers beyond standard NVMe. I believe you can tap into the computational capabilities with additional code.

jamesfmilne 1 year ago |

HW RAID is dead, they need to get over it.

We've had good experience with Xinnor, but it's a shame it's proprietary.

I'd love to see a high-performance open-source erasure coding solution for NVMe. The built in offerings in Linux are not cutting it.

znpy 1 year ago |

I wonder how (if?) this will interact and integrate with the current software stack and the various volume-managing filesystems (zfs but also btrfs).

wmf 1 year ago | |

It probably won't. These clever tricks usually don't come to market.