IO Devices and Latency(planetscale.com) |
IO Devices and Latency(planetscale.com) |
You can check out our sandbox here:
Your "1 in a million" comment on durability is certainly too pessimistic once you consider the briefness of the downtime before a new server comes in and re-replicates everything, right? I would think if your recovery is 10 minutes for example, even if each of three servers is guaranteed to fail once in the month, I think it's already like 1 in two million? and if it's a 1% chance of failure in the month failure of all three overlapping becomes extremely unlikely.
Thought I would note this because one-in-a-million is not great if you have a million customers ;)
Absolutely. Our actual durability is far, far, far higher than this. We believe that nobody should ever worry about losing their data, and thats the peace of mind we provide.
If you replace the failed(or failing) node right away, the failure percentage goes down greatly. You would likely need the probability of a node going done in 30 minutes time space. Assuming the migration can be done in 30 min.
(i hope this calculation is correct)
If 1% probability per month then 1%/(43800/30) = (1/1460)% probability per 30 min.
For three instances: (1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.
Calculated for one month (1/3112136000)% * (43800/30) = (1/2131600)%
So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month. After the 30 minutes another replica will already be available, making the data safe.
I'm happy to be corrected. The probability course was some years back :)
Edit: And for real, fantastic work, this is awesome.
If this helps as context, the git diff for merging this into our website was: +5,820 −1
(it's a topic I'm deeply familiar with so I don't have a comment on the content, it looks great on a skim!) - but I've been sketching animations for my own blog and not liked the last few libs I tried.
Thanks!
We are generally bad at internalizing comparisons at these scales. The visualizations make a huge difference in building more detailed intuitions.
Really nice work, thank you!
This is beautiful and brilliant, and also is a great visual tool to explain how some of the fundamental algorithms and data structures originate from the physical characteristics of storage mediums.
I wonder if anyone remembers the old days where you programmed your own custom defrag util to place your boot libs and frequently used apps to the outer tracks of the hard drive, so they are loaded faster due to the higher linear velocity of the outermost track :)
For reasons discussed in your article we would arrange tape processing as much as possible in sequential scans, something at which COBOL was quite excellent. One of the common performance problems was when there was a mismatch between a slower COBOL processing speed that could not keep up with the flow of blocks coming off the drive head.
In this case you would see the drive start to overshoot as it read more blocks than the COBOL program could handle. The drive would begin a painful jump forward/spool backward motion which made the performance issue quite visible. You would then eyeball the code to understand way the program was not keeping up, correct, and resubmit until the motion disappeared.
The only add is that it understates the impact of SSD parallelism. 8 Channel controllers are typical for high end devices and 4K random IOPS continue to scale with queue depth, but for an introduction the example is probably complex enough.
It is great to see PlanetScale moving in this direction and sharing the knowledge.
I'm curious, what do you do on the internet without js these days?
Latency is king in all performance matters. Especially in those where items must be processed serially. Running SQLite on NVMe provides a latency advantage that no other provider can offer. I don't think running in memory is even a substantial uplift over NVMe persistence for most real world use cases.
Mel never wrote time-delay loops, either, even when the balky Flexowriter
required a delay between output characters to work right.
He just located instructions on the drum
so each successive one was just past the read head when it was needed;
the drum had to execute another complete revolution to find the next instruction.
[0] https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...Our workaround was this: https://discord.com/blog/how-discord-supercharges-network-di...
That said, we're running a redundant system in which MySQL semi-sync replication ensures every write is durable to two machines, each in a different availability zone, before that write's acknowledged to the client. And our Kubernetes operator plus Vitess' vtorc process are working together to aggressively detect and replace failed or even suspicious replicas.
In GCP we find the best results on n2d-highmem machines. In AWS, though, we run on pretty much all the latest-generation types with instance storage.
Having recently added support for storing our incremental indexes in https://github.com/feldera/feldera on S3/object storage (we had NVMe for longer due to obvious performance advantages mentioned in the previous article), we'd be happy for someone to disrupt this space with a better offering ;).
1. Some systems do not support replication out of the box. Sure your cassandra cluster and mysql can do master slave replication, but lots of systems cannot.
2. Your life becomes much harder with NVME storage in cloud as you need to respect maintenance intervals and cloud initiated drains. If you do not hook into those system and drain your data to a different node, the data goes poof. Separating storage from compute allows the cloud operator to drain and move around compute as needed and since the data is independent from the compute — and the cloud operator manages that data system and draining for that system as well — the operator can manage workload placements without the customer needing to be involved.
Replicated network-attached storage that presents a "local" filesystem API is a powerful way to create durability in a system that doesn't build it in like we have.
If you miss a termination event you miss your chance to copy that data elsewhere. Of course, if you're _always_ copying the data elsewhere, you can rest easy.
I get that local disks are finite, yeah, but I think the core/memory/disk ratio would be good enough for most use cases, no? There are plenty of local disk instances with different ratios as well, so I think a good balance could be found. You could even use local hard disk ones with 20TB+ disks for implementing hot/cold storage.
Big kudos to the PlanetScale team, they're like, finally doing what makes sense. I mean, even AWS themselves don't run Elasticsearch on local disks! Imagine running ClickHouse, Cassandra, all of that on local disks.
The main issue was that after a stop-start event, the disks are wiped. SQL Server can’t automatically handle this, even if the rest of the cluster is fine and there are available replicas. It won’t auto repair the node that got reset. The scripting and testing required to work around this would be unsupportable in production for all but the bravest and most competent orgs.
Example; you get a tenant performance issue on Sunday morning US time. The simplest fix is often rescale to a larger VM for the weekend, then get the A team working on the root cause first thing Monday. The incremental cost is minimal and avoids far more costly staff burnout.
On:
> Another issue with network-attached storage in the cloud comes in the form of limiting IOPS. Many cloud providers that use this model, including AWS and Google Cloud, limit the amount of IO operations you can send over the wire. [...]
> If instead you have your storage attached directly to your compute instance, there are no artificial limits placed on IO operations. You can read and write as fast as the hardware will allow for.
I feel like this might be a dumb series of questions, but:
1. The ratelimit on "IOPS" is precisely a ratelimit on a particular kind of network traffic, right? Namely traffic to/from an EBS volume? "IOPS" really means "EBS volume network traffic"?
2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?
I see pretty clearly putting storage and compute on the same machine strictly a latency win, because you structurally have one less hop every time. But is it also a throughput-per-dollar win too?
The EBS volume itself has a provisioned capacity of IOPS and throughput, and the EC2 instance it's attached to will have its own limits as well across all the EBS volumes attached to it. I would characterize it more like a different model. An EBS volume isn't just just a slice of a physical PCB attached to a PCIe bus, it's a share in a large distributed system a large number of physical drives with its own dedicated network capacity to/from compute, like a SAN.
> 2. Does this save me money? And if yes, is from some weird AWS arbitrage? Or is it more because of an efficiency win from doing less EBS networking?
It might. It's a set of trade-offs.
edit: apparently they build a kafkaesque layer of caching. No thank you, I'll just keep my data on locally attached NVMe.
I can't speak to Neon specifically but I've worked a lot with analytic databases, which often use NVMe SSD caches to operate efficiently on S3 data. For time-ordered datasets like observability (e.g., metrics) most queries go to recent data which in the steady state is not just in NVMe SSD storage but generally RAM as well if you are properly tuned. For example, indexes and other metadata are permanently cached.
In realistic tests of the above scenario the effect of nVME SSD can be surprisingly muted. That's especially true if you can use clusters that spread processing across multiple compute nodes, which gives you more RAM to play with and also multiplies storage bandwith.
There are downsides to S3 of course like restarts, which require management to avoid performance issues.
One small nit: > A typical random read can be performed in 1-3 milliseconds.
Um, no. A 7200 RPM platter completes a rotation in 8.33 milliseconds, so rotational delay for a random read is uniformly distributed between 0 and 8.33ms, i.e. mean 4.16ms.
>a single disk will often have well over 100,000 tracks
By my calculations a Seagate IronWolf 18TB has about 615K tracks per surface given that it has 9 platters and 18 surfaces, and an outer diameter read speed of about 260MB/s. (or 557K tracks/inch given typical inner and outer track diameters)
For more than you ever wanted to know about hard drive performance and the mechanical/geometrical considerations that go into it, see https://www.msstconference.org/MSST-history/2024/Papers/msst...
I’m still annoyed they didn’t include the drain time equation I used for calculating track width, which falls out of one of their equations.
Oh, and I’m very glad you showed differing track sizes across the platter. (BTW, did you know track sizes differ between platters? Google “disks are like snowflakes”)
It seems like they don't emphasise strongly enough _make sure you colocate your server in the same cloud/az/region/dc as our db. I suspect a large fraction of their users don't realise this, and have loads of server-db traffic happening very slowly over the public internet. It won't take many slow db reads (get session, get a thing, get one more) to trash your server's response latency.
There were a few storage methods in between tape & HDDs, notably core memory & magnetic drum memory.
As someone who has also use GSAP a decent amount, these days I usually have a better experience with SVG.js [1].
Best of luck =3
Why SQLite instead of a traditional client-server database like Postgres? Maybe it's a smidge faster on a single host, but you're just making it harder for yourself the moment you have 2 webservers instead of 1, and both need to write to the database.
> Latency is king in all performance matters.
This seems misleading. First of all, your performance doesn't matter if you don't have consistency, which is what you now have to figure out the moment you have multiple webservers. And secondly, database latency is generally miniscule compared to internet round-trip latency, which itself is miniscule compared to the "latency" of waiting for all page assets to load like images and code libraries.
> Especially in those where items must be processed serially.
You should be avoiding serial database queries as much as possible in the first place. You should be using joins whenever possible instead of separate queries, and whenever not possible you should be issuing queries asynchronously at once as much as possible, so they execute in parallel.
Application <-> SQLite <-> NVMe
has orders of magnitude less latency than
Application <-> Postgres Client <-> Network <-> Postgres Server <-> NVMe
> You should be avoiding serial database queries as much as possible in the first place.
I don't get to decide this. The business does.
If only one thread of writing is required, then SQLite works absolutely great.
The whole point of getting your commands down to microsecond execution time is so that you can get away with just one thread of writing.
Entire financial exchanges operate on this premise.
Update: about 800us on a more modern system.
write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 35],
| 30.00th=[ 37], 40.00th=[ 38], 50.00th=[ 40], 60.00th=[ 43],
| 70.00th=[ 53], 80.00th=[ 60], 90.00th=[ 70], 95.00th=[ 84],
| 99.00th=[ 137], 99.50th=[ 174], 99.90th=[ 404], 99.95th=[ 652],
| 99.99th=[ 2311]Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`
Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
clat percentiles (nsec):
| 1.00th=[ 1128], 5.00th=[ 1176], 10.00th=[ 1240], 20.00th=[ 1320],
| 30.00th=[ 1448], 40.00th=[ 1496], 50.00th=[ 1528], 60.00th=[ 1576],
| 70.00th=[ 1640], 80.00th=[ 1720], 90.00th=[ 1816], 95.00th=[ 1960],
| 99.00th=[ 2576], 99.50th=[ 3376], 99.90th=[ 10816], 99.95th=[ 32640],
| 99.99th=[124416]
bw ( KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
iops : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
lat (usec) : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
lat (usec) : 100=0.02%, 250=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
sync percentiles (usec):
| 1.00th=[ 15], 5.00th=[ 16], 10.00th=[ 16], 20.00th=[ 17],
| 30.00th=[ 18], 40.00th=[ 25], 50.00th=[ 26], 60.00th=[ 26],
| 70.00th=[ 26], 80.00th=[ 26], 90.00th=[ 26], 95.00th=[ 27],
| 99.00th=[ 34], 99.50th=[ 79], 99.90th=[ 101], 99.95th=[ 126],
| 99.99th=[ 347]
The same test on SN850X Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
clat percentiles (nsec):
| 1.00th=[ 502], 5.00th=[ 540], 10.00th=[ 572], 20.00th=[ 612],
| 30.00th=[ 644], 40.00th=[ 668], 50.00th=[ 708], 60.00th=[ 748],
| 70.00th=[ 804], 80.00th=[ 868], 90.00th=[ 1032], 95.00th=[ 1176],
| 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
| 99.99th=[66048]
bw ( KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s amples=19
iops : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
lat (nsec) : 500=0.80%, 750=58.72%, 1000=29.04%
lat (usec) : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
lat (usec) : 100=0.01%, 250=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
sync percentiles (usec):
| 1.00th=[ 145], 5.00th=[ 149], 10.00th=[ 151], 20.00th=[ 151],
| 30.00th=[ 159], 40.00th=[ 159], 50.00th=[ 159], 60.00th=[ 159],
| 70.00th=[ 159], 80.00th=[ 161], 90.00th=[ 198], 95.00th=[ 202],
| 99.00th=[ 396], 99.50th=[ 416], 99.90th=[ 594], 99.95th=[ 1467],
| 99.99th=[ 5145]There are only a few major NAND manufacturers: Samsung, Micron, Kioxia / Western Digital, SK Hynix, and their branded products are usually the best.
There are also several 3rd party controller developers: Phison, Marvell, Silicon Motion, which I think are the largest, and then a bunch of others.
I hadn't looked at this in a couple years, so 16 channel controllers are more common now, but only on high end enterprise devices.
4KB random read/write specs are definitely not trustable without testing. They are usually at max queue depth and, at least for consumer devices, based on writing to a buffer in SLC mode, so they will be a lot lower once the buffer is exhausted. Enterprise specs might be more realistic but there isnt as much public testing data available.
Neither is a good assumption from my experience. Failures being correlated to any degree greatly increases the chances of what the aviation world refers to as “the holes in the Swiss cheese lining up”.
Browse the web, send/receive email, read stories, play games, the usual. I primarily use native apps and selectively choose what sites are permitted to use javascript, instead of letting websites visited on a whim run javascript willy nilly.
On my older system I had a WD_BLACK SN850X but had it connected to an M.1 slot which may be limiting. This is where I measured 1-2ms latency.
Is there any good place to get numbers of what is possible with enterprise hardware today? I've struggled for some time to find a good source.
[1] https://blocksandfiles.com/2023/08/07/kioxias-rocketship-dat...
Is my understanding correct, that this means you propagate writes asynchronously from the primary to the secondary servers (without waiting for an "ACK" from them for writes)?
As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).
# put the following in a file "fio.job" and run "fio fio.job"
# enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
[Job1]
#direct=1
fsync=1
readwrite=randwrite
bs=64k # size of each write()
size=256m # total size written write: IOPS=3007, BW=11.7MiB/s (12.3MB/s)(118MiB/10001msec); 0 zone resets
clat (usec): min=196, max=23274, avg=331.13, stdev=220.25
lat (usec): min=196, max=23275, avg=331.25, stdev=220.27
clat percentiles (usec):
| 1.00th=[ 210], 5.00th=[ 223], 10.00th=[ 235], 20.00th=[ 262],
| 30.00th=[ 297], 40.00th=[ 318], 50.00th=[ 330], 60.00th=[ 343],
| 70.00th=[ 355], 80.00th=[ 371], 90.00th=[ 400], 95.00th=[ 429],
| 99.00th=[ 523], 99.50th=[ 603], 99.90th=[ 1631], 99.95th=[ 2966],
| 99.99th=[ 8225]I just feel like you'd need thousands of concurrent users on a typical CRUD app to even get close to straining SQLite.
For small traffic, it's pretty simple to run it on the same host as web app, and unix auth means there are no passwords to manage. And once you need to have multiple writers, there is no need to rewrite all the database queries.
I share OPs skepticism. Market makers invest in microwave towers, FPGAs, etc. I would be surprised if sqlite backed by NVME is on the other end of all that specialized hardware.
Order matching is a single threaded thing though. I would be curious if anyone knows how electronic trading systems are actually implemented.
The browser console spams this link: https://react.dev/errors/418?invariant=418
edit: looks like it's caused by a userstyles extension injecting a dark theme into the page; React doesn't like it and the page silently breaks.
False alarm :) Amazing work!!
I was just trying to get a better understanding of what is happening under the hood :)
> I would be surprised if sqlite backed by NVME is on the other end of all that specialized hardware.
I was not making this assertion. I am surprised anything like it got inferred (i.e., my use of the word "premise" regarding single thread/writer policy).
I agree that what you describe would be ridiculous in practice.
> I would be curious if anyone knows how electronic trading systems are actually implemented.
> Order matching is a single threaded thing though.
I think you answered your own question.
[citation needed]. Local network access shouldn't be much different than local IPC.
In what production scenarios do MySQL, Postgres, DB2, Oracle, et. al., live on the same machine as the application that uses them?
I am pretty sure most of these vendors would offer strict guidance to not do that.
Then you'd be wrong. Running Postgres or MySQL on the same host where Apache is running is an extremely common scenario for sites starting out. They run together on 512 MB instances just fine. And on an SSD, that can often handle a surprising amount of traffic.
As popularity grows, the next step is to separate out the database on its own server, but mostly as a side effect of the fact that you now need multiple web servers, but still a single source of truth for data. Databases are lighter-weight than you seem to think.
But no, a local network hop doesn't introduce "orders of magnitude" more latency. The article itself describes how it is only 5x slower within a datacenter for the roundtrip part -- not 100x or 1,000x as you are claiming. But even that is generally significantly less than the time it takes the database to actually execute the query -- so maybe you see a 1% or 5% speedup of your query. It's just not a major factor, since queries are generally so fast anyways.
The kind of database latency that you seem to be trying to optimize for is a classic example of premature optimization. In the context of a web application, you're shaving microseconds for a page load time that is probably measured in hundreds of milliseconds for the user.
> I don't get to decide this. The business does.
You have enough power to design the entire database architecture, but you can't write and execute queries more efficiently, following best practices?
No they can't. That doesn't even make sense as a claim regarding bandwidth since SQLite doesn't use any, but please re-read what I said about being a 1% or 5% difference in speed. Not 10x.
Same-core context switching costs a few microseconds.
Going across core complexes can cost tens to hundreds of microseconds.
These figures are several orders of magnitude (5-6) slower than L1 access on the same thread.
Communication between processes is negligible compared to all of the sequential disk/SSD accesses and processing required for executing queries.
The database isn't stored in L1 and communication isn't taking hundreds of microseconds. I don't know where you're getting your information.
The fact that SQLite is in-process is primarily about simplicity and convenience, not performance. Performance can even be worse, e.g. due to the lack of a query cache.