How to Write to SSDs [pdf]

How to Write to SSDs [pdf](vldb.org)

205 points by matt_d 2 days ago | 32 comments

lia323 2 days ago |

Hi, I’m the first author of the paper. Thanks for the interest and the kind comments.

The extended version is available on arXiv if you’d like more details: https://arxiv.org/pdf/2603.09927

The appendix includes additional details and FAQ-style answers that did not fit into the VLDB version.

eekdrnf9904 6 hours ago | |

Amazing Paper!

zipy124 1 day ago | |

fantastic paper!

maxi-k 1 day ago |

> we introduce a NoWA (No Write Amplification) pattern that guarantees SSD WAF = 1, even at full device utilization.

That they got this to work on regular commodity SSDs (from multiple vendors) is very impressive.

zozbot234 1 day ago | |

Very interesting indeed. They mention a very simple rule of thumb (not new to this work AIUI but still worthwhile) that suggests arranging data into blocks that will all be discarded in bulk at the same time. Doing this is generally already enough to make a dent into write-amplification.

lia323 1 day ago | | |

The end goal of the NoWA write pattern is conceptually similar to what you described, in the sense that NoWA tries to increase the chance that data becomes invalid together inside the SSD, which is also what mechanisms such as TRIM try to facilitate. The NoWA pattern is more about proactively aligning the application-level GC behavior with the SSD’s internal GC behavior, such that the SSD has little or no valid-page movement left to do internally.

user01815-2 2 days ago |

This seems to miss a reference to Zoned XFS, which is the Linux file system that actually looked into this kind of data placement at the file system layer. The paper includes numbers using RocksDB:https://dl.acm.org/doi/10.1145/3725783.3764399

lia323 1 day ago | |

Thanks for pointing this out. I’ll add the reference to the arXiv version later.

In our paper, we only evaluated with regular XFS (see Section 10.3, “What happens if a filesystem is used?” in the arXiv version), but evaluating Zoned XFS would definitely be interesting as well.

ece 1 day ago | |

There seems to be more details about the Linux implementation here: https://zonedstorage.io/

pgaddict 21 hours ago |

Interesting paper. I only started reading / digesting it, but:

- I'm not sure how to interpret the Figure 1. It says "Flash writes (KB) per page", but it doesn't really say which page sizes were used. AFAIK MySQL has 16K by default, PostgreSQL has 8K, LeanStore has 4K, but that which makes the numbers a bit hard to compare.

- Likewise, I'm a bit unsure about the doublewrite buffering in Postgres, described as "indirect". Postgres doesn't really do doublewrite (we really should, I think), we write pages to WAL and then to data files. I assume that's what is meant by "indirect" in the paper. But this very much depends on the checkpoint frequency and write pattern, as the FPI is written only for the first page change. I wonder if the results in the paper consider this. Maybe the workload is such that it always hits the page just once between checkpoints (i.e. a worst case). Also, the WAL part is nicely sequential, which should play nice with SSDs.

lia323 15 hours ago | |

Thanks!

The caption of Figure 1 lists the page size used by each system (i.e., the default configuration).

We use different page sizes across systems, and as you said, it is a bit difficult to compare them directly apples-to-apples. This is actually intentional! Because this also exposes the B-tree index-level write amplification effects. In that sense, Figure 1 kind of suggests that larger page sizes may not necessarily be great for write amplification.

And yes, you are correct regarding Postgres. Instead of having a separate doublewrite buffer file, Postgres relies on WAL full-page writes, which indirectly trigger additional checkpoint writes, so the effect is not entirely straightforward to quantify. To explain that, we discuss how we measured DB WAF for Postgres in Section 10.7 (“How can we calculate DB WAF on other DBMSs?”) of the appendix version: https://arxiv.org/pdf/2603.09927

Regarding the WAL part, yes, the WAL itself is nicely sequential and should generally behave well on SSDs. But once it gets mixed with small random writes that are eventually persisted to flash, it will unfortunately still likely suffer from SSD WAF.

itsthecourier 2 days ago |

this is the kind of research that creates new db types, or super optimized postgres im not sure yet

jandrewrogers 2 days ago | |

This paper gives a really nice end-to-end treatment of an entire problem domain that is usually taken piecemeal. Almost all of the techniques mentioned are already used in databases in some form. It won't lead to new database types but it provides a framework for thinking about the write amplification problem.

Not every database architecture will be able to easily take advantage of all these techniques. Some designs are much more easily optimizable than others.

melhindi 2 days ago | | |

To add to that: some of the techniques are well known to storage experts, but not yet widespread among database engineers. The paper does a great job of explaining the effects on database systems. Great work!

vetrom 2 days ago | |

can be both, psql has pluggable storage engines. See any of the numerous columnar or sharding extensions for postgres for examples of prior art.

Dwedit 2 days ago |

SMR Hard Drives have very different rules about how you should access them vs conventional hard drives or SSDs. I wonder how much optimizing for SMR drives (Big sequential writes) would also optimize for other drive types.

schobi 2 days ago |

The paper shows a through analysis of write amplification and slowdown/wear with large databases (800GB) on a single machine. Databases are MySQL and postgres. As already commended, this can lead to an optimized storage table format for greater performance. Nice!

I would expect that a similar analysis can be done for sqlite, maybe with a different dataset, single write thread..

lia323 1 day ago | |

Thanks! I have not tested SQLite myself, but it would definitely be worthwhile to evaluate as well. SQLite would likely suffer from write amplification in a similar way as MySQL or PostgreSQL, since it is also a page-based DBMS with in-place updates, regardless of the single-writer design.

The degree of the resulting write amplification depends on several factors, including the fill factor, write skewness, and the write rate relative to the SSD characteristics. We discuss this in more detail in Section 10.2, “When should the DBMS care about WAF?” in the extended arXiv version.

There is also this paper on SQLite/mobile storage and zoned devices that may be relevant in this context: https://www.usenix.org/system/files/atc24-hwang.pdf

ifiokambrose 1 day ago |

Thanks for sharing!

UltraSane 1 day ago |

Enterprise storage systems solve this problem by having writes go to 8GB or more of NVRAM and then get consolidated and flushed to the SSDs. I wish consumer grade systems used a similar system.

lia323 1 day ago | |

All experiments in the paper were done using enterprise SSDs. Large write buffers inside the SSD can definitely help mask the performance degradation of slow flash writes by absorbing and consolidating updates before flushing to flash, but they do not fundamentally solve write amplification itself.

Unless the write access pattern repeatedly hits the internal write buffer such that many updates are absorbed before they ever need to be persisted to flash.

UltraSane 1 day ago | | |

In enterprise storage systems the NVRAM write buffer is centrally located in the controllers and the proprietary filesystem is designed to use it. This means SSDs very rarely have to handle small writes.

dofi4ka 2 days ago |

I feel fooled after clicking on the link and seeing that this PDF is downloading (or just literally writing to my SSD) until I realized that this is the point

patrulek 2 days ago | |

This is why im using ramdisk for a browser cache.