How to Write to SSDs [pdf](vldb.org) |
How to Write to SSDs [pdf](vldb.org) |
The extended version is available on arXiv if you’d like more details: https://arxiv.org/pdf/2603.09927
The appendix includes additional details and FAQ-style answers that did not fit into the VLDB version.
That they got this to work on regular commodity SSDs (from multiple vendors) is very impressive.
In our paper, we only evaluated with regular XFS (see Section 10.3, “What happens if a filesystem is used?” in the arXiv version), but evaluating Zoned XFS would definitely be interesting as well.
- I'm not sure how to interpret the Figure 1. It says "Flash writes (KB) per page", but it doesn't really say which page sizes were used. AFAIK MySQL has 16K by default, PostgreSQL has 8K, LeanStore has 4K, but that which makes the numbers a bit hard to compare.
- Likewise, I'm a bit unsure about the doublewrite buffering in Postgres, described as "indirect". Postgres doesn't really do doublewrite (we really should, I think), we write pages to WAL and then to data files. I assume that's what is meant by "indirect" in the paper. But this very much depends on the checkpoint frequency and write pattern, as the FPI is written only for the first page change. I wonder if the results in the paper consider this. Maybe the workload is such that it always hits the page just once between checkpoints (i.e. a worst case). Also, the WAL part is nicely sequential, which should play nice with SSDs.
The caption of Figure 1 lists the page size used by each system (i.e., the default configuration).
We use different page sizes across systems, and as you said, it is a bit difficult to compare them directly apples-to-apples. This is actually intentional! Because this also exposes the B-tree index-level write amplification effects. In that sense, Figure 1 kind of suggests that larger page sizes may not necessarily be great for write amplification.
And yes, you are correct regarding Postgres. Instead of having a separate doublewrite buffer file, Postgres relies on WAL full-page writes, which indirectly trigger additional checkpoint writes, so the effect is not entirely straightforward to quantify. To explain that, we discuss how we measured DB WAF for Postgres in Section 10.7 (“How can we calculate DB WAF on other DBMSs?”) of the appendix version: https://arxiv.org/pdf/2603.09927
Regarding the WAL part, yes, the WAL itself is nicely sequential and should generally behave well on SSDs. But once it gets mixed with small random writes that are eventually persisted to flash, it will unfortunately still likely suffer from SSD WAF.
Not every database architecture will be able to easily take advantage of all these techniques. Some designs are much more easily optimizable than others.
I would expect that a similar analysis can be done for sqlite, maybe with a different dataset, single write thread..
The degree of the resulting write amplification depends on several factors, including the fill factor, write skewness, and the write rate relative to the SSD characteristics. We discuss this in more detail in Section 10.2, “When should the DBMS care about WAF?” in the extended arXiv version.
There is also this paper on SQLite/mobile storage and zoned devices that may be relevant in this context: https://www.usenix.org/system/files/atc24-hwang.pdf
Unless the write access pattern repeatedly hits the internal write buffer such that many updates are absorbed before they ever need to be persisted to flash.
It's a fairly simple concept that lets you have some write-affinity, that lets you declare when writing that this write should be associated with other writes with the same FDP number, a tagging.
I'm not fully convinced this really is as good as what the open channel flash people wanted. But drive manufacturers were never voluntarily going to give up really complex Flash Translation Layers. They all want to be value add, have their expensive fancy controllers keeping the market from commoditizing to just using NAND directly. But FDP does show some very real promises, can have huge read-write-affinity bonuses!
I note that SSDFS filesystem is still out there being improved and maintained, for a file system that tries to take advantage of this all. I'm not sure if it's made the jump to using FDP or is still older much more ornery & never quite loved ZNS specification. I'd love to give it a try but FDP and ZNS drives are not easy to get ahold of, require asking very nicely, and when I last checked required purchasing very expensive fancy enterprise SSD that cost a ton but which had pretty so so performance figures. That was a couple years ago now. https://www.phoronix.com/news/Linux-SSDFS-NVMe-ZNS-SSDs https://news.ycombinator.com/item?id=34939248
The paper here is wonderful & beautiful. FDP should make this kind of thing so so so much easierz should remove so many of the downsides of drive usage mentioned here. If only it were available. I'd really love it if driver reviewers would look at and comment on the feature matrix drives have, and comment on FDP, but generally, it feels like there's no ask, little pull and thus no push, for an obvious and basically free to implement zero cost improvement that makes everything vastly better. Alas. Can't wait. Hopefully drive prices are better by 2031 & FDP is finally available. Fingers crossed.
> Storage nerd @ Google
Vendors, are you listening?
(And the software would/could be so much better... If this were available to play with)
I assume NVRAM buffering at that layer will definitely make the write access pattern less skewed from the SSD’s point of view and can therefore reduce WAF. We did not evaluate that kind of storage-stack setup in the paper, though.