Continuous reinvention: A brief history of block storage at AWS

Continuous reinvention: A brief history of block storage at AWS(allthingsdistributed.com)

385 points by riv991 1 year ago | 83 comments

mjb 1 year ago |

Super cool to see this here. If you're at all interested in big systems, you should read this.

> Compounding this latency, hard drive performance is also variable depending on the other transactions in the queue. Smaller requests that are scattered randomly on the media take longer to find and access than several large requests that are all next to each other. This random performance led to wildly inconsistent behavior.

The effect of this can be huge! Given a reasonably sequential workload, modern magnetic drives can do >100MB/s of reads or writes. Given an entirely random 4kB workload, they can be limited to as little as 400kB/s of reads or writes. Queuing and scheduling can help avoid the truly bad end of this, but real-world performance still varies by over 100x depending on workload. That's really hard for a multi-tenant system to deal with (especially with reads, where you can't do the "just write it somewhere else" trick).

> To know what to fix, we had to know what was broken, and then prioritize those fixes based on effort and rewards.

This was the biggest thing I learned from Marc in my career (so far). He'd spend time working on visualizations of latency (like the histogram time series in this post) which were much richer than any of the telemetry we had, then tell a story using those visualizations, and completely change the team's perspective on the work that needed to be done. Each peak in the histogram came with it's own story, and own work to optimize. Really diving into performance data - and looking at that data in multiple ways - unlocks efficiencies and opportunities that are invisible without that work and investment.

> Armed with this knowledge, and a lot of human effort, over the course of a few months in 2013, EBS was able to put a single SSD into each and every one of those thousands of servers.

This retrofit project is one of my favorite AWS stories.

> The thing that made this possible is that we designed our system from the start with non-disruptive maintenance events in mind. We could retarget EBS volumes to new storage servers, and update software or rebuild the empty servers as needed.

This is a great reminder that building distributed systems isn't just for scale. Here, we see how building the system in a way that can seamlessly tolerate the failure of a server, and move data around without loss, makes large scale operations (everything from day-to-day software upgrades to a massive hardware retrofit project) possible that just wouldn't be possible in a "simpler" architecture. A "simpler" architecture would make these operations much harder, to the point of being impossible, making the end-to-end problem we're trying to solve for the customer harder.

dekhn 1 year ago | |

It;s funny you mentioned Marc worked on latency viz and used it to tell a story. Dick Lyon at Google did the same thing for Google's storage servers https://www.pdl.cmu.edu/SDI/2015/slides/DatacenterComputers.... (starting at Slide 62) identifying various queues and resource contention as major bottlenecks for block storage.

msolson 1 year ago | | |

A picture can be worth way more than a thousand words, but sometimes you have to iterate through a thousand pictures to find the one that tells the right story, or helps you ask the right question!

yetanotherdood 1 year ago | | |

Ah yes diskless borg :)

jedberg 1 year ago |

Ah, this brings back memories. Reddit was one of the very first users of EBS back in 2008. I thought I was so clever when I figured out that I could get more IOPS if I build a software raid out of five EBS volumes.

At the time each volume had very inconsistent performance, so I would launch seven or eight, and then run some each write and read loads. I'd take the five best performers and then put them into a Linux software raid.

In the good case, I got the desired effect -- I did in fact get more IOPS then 5x a single node. But in the bad case, oh boy was it bad.

What I didn't realize was that if you're using a software raid, if one node is slow, the entire raid moves at the speed of the slowest volume. So this would manifest as a database going bad. It took a while to figure out it was the RAID that was the problem. And even then, removing the bad node was hard -- the software raid really didn't want to let go of the bad volume until it could finish writing out to it, which of course was super slow.

And then I would put in a new EBS volume and have to rebuild the array, which of course it was also bad at because it would be bottlenecked on the IOPS for the new volume.

We moved off of those software raids after a while. We almost never used EBS at Netflix, in part because I would tell everyone who would listen about my folly at reddit, and because they had already standardized on using only local disk before I ever got there.

And an amusing side note, when AWS had that massive EBS outage, I still worked at reddit and I was actually watching Netflix while I was waiting for the EBS to come back so I could fix all the databases. When I interviewed at Netflix one of the questions I asked them was "how were you still up during the EBS outage?", and they said, "Oh, we just don't use EBS".

cyberax 1 year ago | |

> Ah, this brings back memories. Reddit was one of the very first users of EBS back in 2008. I thought I was so clever when I figured out that I could get more IOPS if I build a software raid out of five EBS volumes.

Hey! We also did that! It turned out, that eventually you hit the network bandwidth limit. I think, the performance topped out at around 160 megabytes per second for most of the instance types back then.

mgdev 1 year ago |

It's cool to read this.

One interesting tidbit is that during the period this author writes about, AWS had a roughly 4-day outage (impacted at least EC2, EBS, and RDS, iirc), caused by EBS, that really shook folks' confidence in AWS.

It resulted in a reorg and much deeper investment in EBS as a standalone service.

It also happened around the time Apple was becoming a customer, and AWS in general was going through hockey-stick growth thanks to startup adoption (Netflix, Zynga, Dropbox, etc).

It's fun to read about these technical and operational bits, but technical innovation in production is messy, and happens against a backdrop of Real Business Needs.

I wish more of THOSE stories were told as well.

BikiniPrince 1 year ago | |

It was a good year after that incident. We focused on stability and driving down issues. We turned around a lot of development idea too. However, the wheel turns and we were back on feature development. I’ll always remember that year as having the fewest escalations during my entire time there.

abrookewood 1 year ago |

This is the bit I found curious: "adding a small amount of random latency to requests to storage servers counter-intuitively reduced the average latency and the outliers due to the smoothing effect it has on the network".

Can anyone explain why?

wmf 1 year ago | |

Synchronized network traffic can cause incast or other buffer overflows.

refibrillator 1 year ago | | |

Yeah jitter is generally used to mitigate “thundering herd” type problems because it reduces the peak load by spreading it out over time.

simonebrunozzi 1 year ago |

If you're curious, this is a talk I gave back in 2009 [0] about Amazon S3 internals. It was created from internal assets by the S3 team, and a lot in there influenced how EBS was developed.

[0]: https://vimeo.com/7330740

lysace 1 year ago |

I liked the part about them manually retrofitting an SSD in every EBS unit in 2013. That looks a lot like a Samsung SATA SSD:

https://www.allthingsdistributed.com/images/mo-manual-ssd.pn...

I think we got SSDs installed in blades from Dell well before that, but I may be misremembering.

I/O performance was a big thing in like 2010/2011/2012. We went from spinning HDs to Flash memory.

I remember experimenting with these raw Flash-based devices, no error/wear level handling at all. Insanity, but we were all desperate for that insane I/O performance bump from spinning rust to silicon.

BikiniPrince 1 year ago | |

It was only a handful of frankenracks. It was challenging and not very performant, but it let everyone get a jump on the research. Disk speed was increasing so fast in six months the first SKU was out of date. I’m glad I didn’t have to make the argument directly to assets when we retired those racks years earlier than planned. The rack positions were so much more valuable with the new denser and faster models.

rnts08 1 year ago |

This gives me fond memories of building storage-as-a-service infrastructure back before we had useful opensource stuff, moving away from sun san, fibrechannel and solaris we landed on glusterfs on supermicro storage servers, running linux and nfs. We peaked almost 2Pb before I moved on in 2007.

Secondly it reminds me of the time when it simply made sense to ninja-break and rebuild mdraids with ssds in-place of the spinning drives WHILE the servers were running (sata kind of supported hotswapping the drives). Going from spinning to ssd gave us a 14x increase in IOPS in the most important system of the platform.

0xbadcafebee 1 year ago |

At the very start of my career, I got to work for a large-scale (technically/logistically, not in staff) internet company doing all the systems stuff. The number of lessons I learned in such a short time was crazy. Since leaving them, I learned that most people can go almost their whole careers without running into all those issues, and so don't learn those lessons.

That's one of the reasons why I think we should have a professional license. By requiring an apprenticeship under a master engineer, somebody can pick up incredibly valuable knowledge and skills (that you only learn by experience) in a very short time frame, and then be released out into the world to be much more effective throughout their career. And as someone who also interviews candidates, some proof of their experience and a reference from their mentor would be invaluable.

ponector 1 year ago | |

Imagine you got your license and then tasked to make a crud service with some simple UI because that is what is needed for the client and they cannot use unlicensed developers.

0xbadcafebee 1 year ago | | |

That wouldn't happen. Professional licenses vary by trade, state, cost of project, size of project, impact of work, etc, etc. If it's trivial, you don't need anything. If it is critical, you need a bunch. If it's in between, it depends. The world isn't black and white.

lispisok 1 year ago | | |

It's a common misunderstanding that a professional license would be required to perform any kind of work which is not true of the professional engineering license.

herodoturtle 1 year ago |

Loved this:

> While the much celebrated ideal of a “full stack engineer” is valuable, in deep and complex systems it’s often even more valuable to create cohorts of experts who can collaborate and get really creative across the entire stack and all their individual areas of depth.

tanelpoder 1 year ago |

The first diagram in that article is incorrect/quite outdated. Modern computers have most PCIe lanes going directly into the CPU (IO Hub or "Uncore" area of the processor), not via a separate PCH like in the old days. That's an important development for both I/O throughput and latency.

Otherwise, great article, illustrating that it's queues all the way down!

msolson 1 year ago | |

Thanks for the comment, and you're right, modern computers do have a much better architecture! As I was laying out the story I was thinking about what it looked like when we started. I'll clarify that in the image caption that it's from that era.

bravetraveler 1 year ago | | |

We may be going full circle with consumer systems being so light for lanes under PCI-e gen5!

There's usually enough for a GPU, SSD or two... and that's about it. I don't like having to spend so much for fast IO, dangit.

Can sometimes find boards that do switching to appease :/

tanelpoder 1 year ago | | |

Cool, yep this is just a minor detail and doesn't change what the article itself conveys.

hodgesrm 1 year ago | | |

Thanks for a very informative article. It would be interesting have more detail about how EBS achieved a strong culture around quality and system resilience. Maybe a future post?

pbw 1 year ago |

Early on, the cloud's entire point was to use "commodity hardware," but now we have hyper-specialized hardware for individual services. AWS has Graviton, Inferentia and Tranium chips. Google has TPUs and Titan security cards, Azure uses FPGA's and Sphere for security. This trend will continue.

moralestapia 1 year ago |

Great article.

"EBS is capable of delivering more IOPS to a single instance today than it could deliver to an entire Availability Zone (AZ) in the early years on top of HDDs."

Dang!

apitman 1 year ago |

What's the best way to provide a new EC2 instance with a fast ~256GB dataset directory? We're currently using EBS volumes but it's a pain to do updates to the data because we have to create a separate copy of the volume for each instance. EFS was too slow. Instance storage SSDs are ephemeral. Haven't tried FSx Lustre yet.

MaBu 1 year ago | |

EFS supports 30 GiB/s throughput now. https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-ef...

Otherwise instance drive and sync over S3.

ayewo 1 year ago | | |

Instance storage can be incredibly fast for certain workloads but it's a shame AWS doesn't offer instance storage on Windows EC2 instances.

Instance storage seems to only be available for (large) Linux EC2 instances.

apitman 1 year ago | | |

Impressive, but I don't think we every determined conclusively whether our EFS problems were caused by throughput or latency.

Also, throughput is going to be limited by your instance type, right? Though that might also be the case for EBS. I can't remember. Part of the problem is AWS performance is so confusing.

mannyv 1 year ago |

The most surprising thing ia that the author had no previous experience in the domain. It's almost impossible to get hired at AWS now without domain expertise, AFAIK.

msolson 1 year ago | |

At least in the organizations I'm a part of this isn't true. We do look for both specialists and generalists, and focus on experience and how it could apply.

It's difficult to innovate by just repeating what's been done before. But everything you learn along the way helps shape that innovation.

flybarrel 1 year ago | |

I work in EBS. I had no storage background when I joined 3 years ago :)

dasloop 1 year ago |

So true and valid of almost all software development:

> In retrospect, if we knew at the time how much we didn’t know, we may not have even started the project!

Silasdev 1 year ago |

Great read, although a shame that it didn't go any further than adding the write cache SSD solution, which must have been many years ago. I was hoping for a little more recent info on the EBS architecture.

swozey 1 year ago |

I had no idea Werner Vogels had a systems blog. Awesome read, thanks.

tw04 1 year ago |

I think the most fascinating thing is watching them relearn every lesson the storage industry already knew about a decade earlier. Feels like most of this could have been solved by either hiring storage industry experts or just acquiring one of the major vendors.