What every programmer should know about SSDs

What every programmer should know about SSDs(databasearchitects.blogspot.com)

452 points by sprachspiel 5 years ago | 158 comments

bob1029 5 years ago |

Things I have learned about SSDs:

If you want to go fast & save NAND lifetime, use append-only log structures.

If you want to go even faster & save even more NAND lifetime, batch your writes in software (i.e. some ring buffer with natural back-pressure mechanism) and then serialize them with a single writer into an append-only log structure. Many newer devices have something like this at the hardware level, but your block size is still a constraint when working in hardware. If you batch in software, you can hypothetically write multiple logical business transactions per block I/O. When you physical block size is 4k and your logical transactions are averaging 512b of data, you would be leaving a lot of throughput on the table.

Going down 1 level of abstraction seems important if you want to extract the most performance from an SSD. Unsurprisingly, the above ideas also make ordinary magnetic disk drives more performant & potentially last longer.

pclmulqdq 5 years ago | |

I used to think the same thing, but now that I work on SSD-based storage systems, I'm not sure this holds up in today's storage stacks. Log structuring really helped with HDDs since it meant fewer seeks.

In particular, the filesystem tends to undo a lot of the benefits you get from log-structuring unless you are using a filesystem designed to keep your files log-structured. Using huge writes definitely still helps, though.

A paper that I really like goes deeper into this: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

Edit: I had originally said "designed for flash" instead of "designed to keep your files log-structured." F2FS is designed for flash, but in my testing does relatively poorly with log-structured files because of how it works internally.

Edit 2: de-googled the link. Thank you for pointing that out.

10000truths 5 years ago | | |

Achieving cutting-edge storage performance tends to require bypassing the filesystem anyways. Traditionally, that meant using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and operating on it with io_uring will get you most of the way there.

In either case, the advice given in the article and by the OP is filesystem agnostic.

trulyme 5 years ago | | |

Degoogled link: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

gravypod 5 years ago | |

This is the "secret sauce" behind LevelDB: https://github.com/google/leveldb#performance

bob1029 5 years ago | | |

This looks to be a similar technique.

In my testing of these ideas, I've been able to push over 2 million transactions per second (~1Kb per transaction) to a Samsung 960 Pro. For reference, its rated for 2.1GB/s sequential writes, so I've got it pretty much 100% saturated.

The implementation for something like this is actually really underwhelming when you figure out how to put all the pieces together. I assembled this prototype (also a key-value store) using .NET5, LMAX Disruptor, and a splay tree implementation i copied from google somewhere. The hardest part was figuring out how to wait for write completion on the caller side (multiple calling threads are ultimately serialized into a single worker thread via the Disruptor). Turns out, busy wait for a few thousand cycles followed by a yield to the OS is a pretty good trick. You just do a while(true) over a completion flag on the transaction object which is set en masse by the handling thread after the write goes to disk. Batch sizes are determined dynamically based on how long the previous batch took to write. In practice, I never observed a batch that took longer than 2-3 milliseconds on my 960 pro. Max batch size is 4096, and it is permanently full when 100% loaded. A full batch = a nice big IO to disk.

ww520 5 years ago | |

LMDB has similar write characteristics where its b-tree is append-only. This gives LMDB amazing performance and very robust ACID transaction support as immutability is baked in.

fulafel 5 years ago | | |

This is quite common in traditional DBs too. Eg PostgreSQL has its write-ahead log. Both LMDB and PostgreSQL then occasionally need to do do some kind of compaction, checkpoint or garbage collection, whatever it's called in various systems, the write-only log is reset and any live data in it improted into the main db data.

remram 5 years ago | |

Shouldn't the OS or libc take care of that? If I write and don't immediately flush()?

KMag 5 years ago | | |

I don't think most libc implementations take care to buffer to filesystem block/cluster boundaries.

AtlasBarfed 5 years ago | |

This is basically the purpose of rocksdb, and to a lesser extent Cassandra

senderista 5 years ago | |

Also: parallelize your writes. This is the biggest difference between SSDs and HDDs: internal parallelism. You’ll have a hard tine saturating I/O bandwidth even with huge sequential writes if you don’t introduce some parallelism. Fortunately, io_uring makes this easy from a single thread.

hypertele-Xii 5 years ago | |

Buffering writes is fine if you're ok with losing your data. For some applications that's acceptable, but when I'm writing to disk, it's because I want persistence. "It'll get flushed to disk at some point as long as power doesn't go out" is hardly that.

scns 5 years ago | |

Like this?

https://en.wikipedia.org/wiki/NILFS?wprov=sfla1

jedberg 5 years ago |

This page tells me a lot about SSDs, but it doesn't tell me why I need to know these things. It doesn't really give me any indication about how I should change my behavior if I know that I'll be running on SSD vs spinning disk.

I've always been told, "just treat SSDs like slow, permanent memory".

klodolph 5 years ago |

If you care about SSDs, one paper you should read is “Don’t Stack Your Log on My Log” by Yang et al. 2014

https://www.usenix.org/system/files/conference/inflow14/infl...

> Log-structured applications and file systems have been used to achieve high write throughput by sequentializing writes. Flash-based storage systems, due to flash memory’s out-of-place update characteristic, have also relied on log-structured approaches. Our work investigates the impacts to performance and endurance in flash when multiple layers of log-structured applications and file systems are layered on top of a log-structured flash device. We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection. All of these effects can combine to negate the intended positive affects of using a log. In this paper we characterize the interactions between multiple levels of independent logs, identify issues that must be considered, and describe design choices to mitigate negative behaviors in multi-log configurations.

andrewmcwatters 5 years ago |

My opinion is probably... not technically correct... until you have to deal with drive reliability and write guarantees, but I don't think programmers actually have to know anything about SSDs in the same way that developers had to know particular things about HDDs.

This is out of pure speculation, but there had to be a period of time during the mass transition to SSDs that engineers said, OK, how do we get the hardware to be compatible with software that is, for the most part, expecting that hard disk drives are being used, and just behave like really fast HDDs.

So, there's almost certainly some non-zero amount of code out there in the wild that is or was doing some very specific write optimized routine that one day was just performing 10 to 100 times faster, and maybe just because of the nature of software is still out there today doing that same routine.

I don't know what that would look like, but my guess would be that it would have something to do with average sized write caches, and those caches look entirely different today or something.

And today, there's probably some SSD specific code doing something out there now, too.

rossdavidh 5 years ago |

Interesting, and fun to read and think about! And, as a professional programmer for 17 years now, not once have I done anything where this would have been important for me to know (even if I had been running my code on a system with SSD's). So, I'm not convinced the title is at all accurate.

But, fun to read and think about.

cottsak 5 years ago | |

I think the key is hidden in > which can help creating software that is capable of exploiting them

Unless you're writing desktop software or your application behaves in a way where you have actually selected the particular hardware components (most of us in cloud hosting don't do this), you probably don't [need to] care.

dang 5 years ago |

What someone else said about that in 2014:

What every programmer should know about solid-state drives - https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31 comments)

cottsak 5 years ago | |

haha! very similar sections too .. almost looked copied for a brief moment as i skimmed there

FpUser 5 years ago |

It is really puzzling why "every programmer" should burden their already overloaded brains with this. If they're reading/writing some config/data files this knowledge would not help one bit. If they're using database then it falls to the database vendor's to optimize for this scenario.

So I think that unless this "every programmer" is a database storage engine developer (not too many of them I guess) their only concern would be mostly - how close my SSD to that magical point where it has to be cloned and replaced before shit hits the fan.

rabuse 5 years ago |

A little off topic, but I bought a new Macbook Pro with the M1 chip with 8GB of RAM, and I'm worried about the swap usage of this machine wearing out the SSD too quickly. Is this an actual concern, as my swap has been in the multiple GB range with my use?

cbsmith 5 years ago | |

It's an actual concern for you. For Apple it's a variant on planned obsolescence. ;-)

Note though that memory use metrics on MacOS can been a misleading. Make sure that you're seeing what's actually there.

ksec 5 years ago | |

Generally speaking macOS is extremely write heavy for all sort of reason even before the switch to ARM. But in majority of case if should last 4-5 years without problem.

The heavy write bug Apple said was due to misreporting and was fixed ( so they say ).

I do think you should pay attention to it from time to time. iCloud Sync, Spotlight, Safari heavy tabs are all known to cause heavy paging in some corner case. You might end up having a TB of data written for no apparent reason. Apple used to ship their Macbook with MLC, on a 512GB MLC you could do 500TBW without problem, that is ~13 years of usage if you do 100GB write per day. Not sure about the M1 machines.

If you are doing Dev staging, Video and photos editing a lot these drive will fail quite quickly. In the space of 2 - 3 years. Although some would argue MacBook Air are not made for those task. And especially true if you have 8GB and 256GB NAND.

Grazester 5 years ago | |

Why did you get the 8 gig version? If you are using all this swap then your purchased the wrong MacBook.

rabuse 5 years ago | | |

Honestly, don't run much, so didn't think it would be that bad stepping down from my 16GB machine.

1-6 5 years ago | |

From what I’ve been able to gather, the excessive paging may actually have to do with non-native apps running on the M1. Avoid those.

rabuse 5 years ago | | |

Most of my programs are JetBrains IDE's and browsers. Don't know if they're optimized for M1.

raihansaputra 5 years ago | |

I think the excessive wear was caused by a bug. Try upgrading to the .4 release.

kortilla 5 years ago |

The title should be “why SSDs mean programmers no longer have to think about hard drives”.

These are all reasons SSDs are much more pleasant to work with than old platter disks.

cbsmith 5 years ago | |

Well, they no longer need to think about hard disks, but there are a lot assumptions from the world of hard disks that play out very differently in the SSD world.

formerly_proven 5 years ago | | |

I don't think there's any optimization for hard drives that is going to hurt on SSDs, and unoptimized workloads are always going to work better on SSDs. I'm inclined to agree with GP that SSDs are quite close to random-access storage and so there is little to worry about.

abledon 5 years ago | |

Why every programmer of a small subset of programmers who actually need to know this

teddyh 5 years ago |

What everyone should know is that flash drives can lose their data when left unpowered for as little as three months.

dataflow 5 years ago |

What's the flash translation layer made of? Is the flash technology used for that more durable than the rest of the SSD itself? (like say MLC vs. QLC?)

pkaye 5 years ago | |

The FTL is like a virtual memory manager. It is firmware/hardware to manage things like the logical to physical mapping table, garbage collection, error correction, bad block management. Yes there will be a lot of FTL data structures stored on the flash. It can be made durable by redundant copies, writing in SLC mode or having recovery algorithms. I used to develop SSD firmware in the past if you have further questions.

jng 5 years ago | | |

Hey that's very interesting! How much of the FTL logic is done with regular MCU code vs custom hardware? Is there any open source SSD firmware out there that one could look at to start experimenting in this field, or at least something pointing in that direction, be it open or affordable software, firmware, FPGA gateway or even IC IP? I believe there is value in integrating that part of the stack with the higher level software, but it seems quite difficult to experiment unless one is in the right circles / close to the right companies. Thanks!

SeanCline 5 years ago | |

You're right that the FTL has some durability concerns which, in addition to performance, is why it's typically cached in DRAM. Older DRAM-less SSDs were unreliable in the long-term but that's been improving with the adoption of HMB, which lets the SSD controller carve out some system RAM to store FTL data.

riobard 5 years ago |

One thing I'm still puzzled about SSD over-provisioning, which is also mentioned by the tutorial (https://codecapsule.com/2014/02/12/coding-for-ssds-part-4-ad...) recommended by the article:

> A drive can be over-provisioned simply by formatting it to a logical partition capacity smaller than the maximum physical capacity. The remaining space, invisible to the user, will still be visible and used by the SSD controller.

Does the controller read the partition table to decide that the space beyond logic partition is safe to use as scrap?

rdc12 5 years ago | |

The SSD maintains a translation table for all the virtual addresses exposed by the drive, that maps to the underlying flash physical addresses. Any physical address not in that table, is unallocated and the drive can use freely.

riobard 5 years ago | | |

So over-provisioning has to be done before any writes to the drive? What if I want to over-provision a used drive? Discard all blocks first?

ars 5 years ago | |

Any sector with nothing written on it can be used as scrap.

So if you partition the entire thing, but just never write to the full disk (you never use all the space), that also works as overprovisioning.

Partitioning just forces that to happen.

riobard 5 years ago | | |

If I partition the entire drive, eventually all blocks will be used, depending on how the filesystem allocates, right? So to guarantee some free space it's better to over-provision by under-partitioning. Now how do I make sure that on a used drive?

dan-robertson 5 years ago |

See this paper from 2017, The unwritten contract of solid state drives: https://dl.acm.org/doi/10.1145/3064176.3064187

Agentlien 5 years ago |

This reminds me of a recent interview[0] by Digital Foundry with the Core Technology Director of Ratchet and Clank: Rift Apart.

Near the beginning they talk about how targeting the PlayStation 5, which has an SSD, drastically changed how they went about making the game.

In short, the quick data transfer meant they were CPU bound rather than disk bound and could afford to have a lot of uncompressed data streamed directly into memory with no extra processing before use.

[0] https://youtu.be/-YpCQrPRpE0

1_player 5 years ago |

A lot of talk about pages, but no mention about how big these pages are. From a quick look on Google, most SSDs have 4kB pages, with some reaching 8kB or even 16kB.

wtallis 5 years ago | |

SSDs mostly tell the host system that they have 512-byte sectors or sometimes 4kB sectors, and the typical flash translation layer works in 4kB sectors because that's a good fit for the kind of workloads coming from a host system that usually prefers to do things (eg. virtual memory) in 4kB chunks. But the underlying NAND flash page size has been 16kB for years.

cbsmith 5 years ago | | |

...and all that cruft, and the logic to try to make handling of it not so bad, makes for a lot of complexity and unintended consequences.

2OEH8eoCRo0 5 years ago |

>Drives not Disks

And where did the word "drive" come from? I thought it referred to motors that spin the media, which SSDs also do not have.

DrNuke 5 years ago |

A number of high-level techniques help rationalize data management and transfer, but the mileage of practical implementations may vary a lot. Generally speaking, only a small number of applications really need to take care and add a further layer of abstraction, that because the best practices already codified into any widespread language do an acceptable job already.

personjerry 5 years ago |

How big is the write cache usually and how does it work? Typically I've seen the write caches be something like 32MB in size, but the "top speed" seems to be sustained for files much bigger than 32MB, which doesn't make sense to me if that top speed is supposedly from writing to the cache. How does that work?

mikewarot 5 years ago |

If you leave un-partitioned space on the SSD, how the heck does the SSD know it is ok to erase it? Wouldn't it be safer to partition it as an extra drive letter, format it, and then leave that drive alone? That would allow the OS to trim all the "empty" blocks.

qiqitori 5 years ago | |

Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

The actual physical address on the storage chip and the physical address from the operating system's perspective don't have much to do with another. For harddrives, "un-partitioned space" means that there is a physical "chunk of metal" that is unused.

However, that's not the case for SSDs. SSDs dynamically remap "OS-physical" block numbers to whatever they want. (Preferably addresses that have never been used before or that have been discarded/trimmed. If there aren't any available, perhaps to the address that was previously used for the same block number.)

mikewarot 5 years ago | | |

>Not 100% sure what you are replying to, and not sure what you meant by "safer", but this may help:

I'm replying to the whole of comments on this article. The write amplification problem goes up as the number of "free" sectors/blocks goes down. Many solutions have been presented that don't allocate X% of the hard drive... but I'm not sure than any of them let the hard drive's SSD controller know they aren't allocated.

For that to happen, the OS has to have TRIM support, AND the block in question has to be on a volume that the OS is managing.

My worry is that if you have a blank partition, it's not being actively managed by anything, and thus isn't going to be TRIMed, and thus the SSD doesn't know the blocks are free for use.

Thus, leaving an unpartitioned area isn't going to help.

ropeladder 5 years ago |

If sequential and random reads are mostly the same on SSDs, does that make the distinction between columnar and row-based databases/data storage less important?

wtallis 5 years ago | |

Nope, unless your columns are all several kB wide. If you force the hardware to perform a multi-kB read for each 64-bit value you need, you're still going to waste a lot of potential performance.

rectang 5 years ago |

I wince at the amount of wear the `git clean -dxf; npm ci` cycle must be putting on my SSD.

githubalphapapa 5 years ago | |

If you're on Linux, libeatmydata might help reduce the number of writes hitting the SSD.

CoolGuySteve 5 years ago |

The claim about parallelism isn't true. Most benchmarks and my own experience show that sequential reads are still significantly faster than random reads on most NVME drives.

However, random read performance is only somewhere between a 3rd to half as fast as sequential compared to a magnetic disk where it's often 1/10th as fast.

pkaye 5 years ago | |

What kind of queue depth do you test the read performance? The sequential can be made fast at low queue depth by the SSD controller doing prefetch reads internally. I've worked on such algorithms myself.

CoolGuySteve 5 years ago | | |

Show me a benchmark at any queue depth where random reads are as fast as the fastest sequential rate for that drive. It's simply not true.

I suspect it has something to do with prediction on the controller but I'm also not confidently spewing a bunch of bullshit about drive architecture unlike this article.

wly_cdgr 5 years ago |

There's nothing whatsoever I should need to know about SSDs as a Javascript programmer and if there is then the programmers on the lower levels haven't done their jobs right and are wasting my time

hddherman 5 years ago | |

Ever heard of leaky abstractions?

wly_cdgr 5 years ago | | |

Sure, yeah...that's the "haven't done their jobs right and are wasting my time" part

BatteryMountain 5 years ago |

So.. interesting topic. Last year I experimented with some C# + Samsung 970 Evo Plus Nvme + MessagePack (with compression) + Zfs .. to benchmark how fast I could dump objects from .net memory to disk.

The numbers involved was insane and I played with various scenarios, with/without compression (MessagePack feature), with/without typeless serializer (MessagePack feature), with/without async and then the difference between using sync vs async and forcing disk flushes. I also weighed the difference between writing 1 fat file (append only) or millions of small files. I also checked the difference between using .net streams versus using File.WriteAllBytes (C# feature, an all-in-memory operation, good for small writes, bad for bigger files or async serialization + writing). I also played with the amount of objects involved (100K, 1M, 10M, 50M).

I cannot remember all the numbers involved, but I still have the code for all of it somewhere, so maybe I can write a blogpost about it. But I do remember being utttterly stunned about how fast it actually was to freeze my application state to disk and to thaw it again (the class name was Freezer :p).

The whole reason was, I started using Zfs and read up a bit about how it works. I also have some idea about how ssd's work. I also have some idea how serialization works and writing to disk works (streams etc).. I also have a rough idea how mysql, postgres, sql server save their datafiles to disk and what kind of compromises they make. So one day I was just sitting being frustrated with my data access layers and it dawned on me to try and build my own storage engine for fun, so I started by generating millions of objects that sits in memory, which I then serialized with MessagePack using a Parallel.Foreach (C# feature) to a samsung 970 evo plus to see how fast it would be. It blew my mind and I still don't trust that code enough to use it in production but it does work. Another reason why I tried it out, was because at work we have some postgres tables with 60m+ rows that are getting slow and I'm convinced we have a bad data model + too many indexes and that 60m rows are not too much (since then we've partitioned the hell out of it in multiple ways but that is a nightmare on its own since I still think we sliced the data the wrong way, according to my intuition and where the data has natural boundaries, time will tell who was right).

So I do believe there is a space in the industry where SSD's, paired with certain file systems, using certain file sizes and chunking, will completely leave sql databases in the dust, purely by the mechanism on how each of those things work together. I haven't put my code out in public yet and only told one other dev about it, mostly because it is basically sacrilege to go against the grain in our community and to say "I'm going to write my own database engine" sounds nuts even to me.

BrissyCoder 5 years ago |

Why on earth do 99.5% of programmers even need to know what SSD stands for?

fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.