Magical Block Store: Why EBS Can't Work

Magical Block Store: Why EBS Can't Work(joyeur.com)

124 points by lindvall 15 years ago | 53 comments

blantonl 15 years ago |

I am an active user of EBS on a highly trafficked Web properly, and came from a long and tedious background in enterprise software.

I really think that one paragraph in his blog post summed everything up quite nicely. It could not ring more true:

My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for “consolidated idle workloads.” When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.

edw 15 years ago |

This awesome entry perfectly captures why I have always hated NFS. I can deal with the possibility that if a machine's hard drive dies, my system is going to have a very hard time continuing to operate in a normal manner, but then NFS comes along, and you realize that all sorts of I/O operations that previously employed a piece of equipment that failed once every two and a half years now depend on a working network with a working NFS server on that network, and the combination of that network and that server are orders of magnitude less reliable.

And now you have situations on a regular basis where you type "ls" and you shell hangs and not even "kill -9" is going to save you. And you go back to using FTP or some other abstraction that does not apply 40,000 hour MTBF thinking to equipment that disappears for coffee breaks daily.

chubot 15 years ago | |

A thousand times yes. My first thought when hearing about the EBS outage was "wow that seems more fragile than NFS, no wonder it failed spectacularly." NFS presents you with this nice familiar file system interface, and then random sys admins and programmers start creating a tangled mess of dependencies by dropping stuff there, without regard to what happens when it fails. Like the EBS outage, the failures tend to surprise people.

The great quote by Leslie Lamport: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

And this was an excellent and honest article about faulty programming abstractions. It's basically bashing you over the head with the "Fallacies of distributed computing". Don't silently turn local operations into remote operations. They're not the same thing and have to be treated differently at all levels.

Even Werner Vogels wrote a diatribe against "transparency", which is the same issue by another name: http://scholar.google.com/scholar?cluster=700969849916494972...

So I wonder what he thinks of this architectural choice. You have to give up something when communicating over the network. Vogels seems to have chosen consistency rather than availability in his designs. This paper was a turning point in his research. Its candor surprised me.

The file system interface does not let you relax consistency, so by default you have chosen availability. As the Joyent guys honestly remarked, this often has to be learned the hard way.

ssmoot 15 years ago | |

I don't know how NFS keeps coming up. It's an entirely different use case. It doesn't help the credibility of a critique on networked block storage to harp on a vendor specific implementation of a technology that doesn't even operate in the same sphere.

An NFS server is very simple. With NFS on it's own VLAN, and some very basic QoS, there's no reason an NFS server should be the weak point in your infrastructure. Especially since it's resilient to disconnection on a flaky network.

If you're looking for 100% availability, sure, NFS is probably not the answer. If on the other hand you're running a website, and would rather trade a few bad requests for high-availability and portability, then NFS can be a great fit.

None of that has anything to do with EBS or block-storage though.

Joyent's position is that iSCSI was flaky for them because of unpredictable loads on under-performing equipment. The situation would degrade to the point that they could only attach a couple VM hosts to a pair of servers for example, and they were slicing the LUNs on the host, losing the flexibility networked block-storage provides for portability between systems.

Here's what we do:

We export an 80GB LUN for every running application from two SAN systems.

These systems are home-grown, based on Nexenta Core Platform v3. We don't use de-dupe since the DDT kills performance (and if Joyent was using it, then is local storage without it really a fair comparison?). We provide SSDs for ZIL and ARCL2.

These LUNs are then mirrored on the Dom0. That part is key. Most storage vendors want to create a black-box, bullet-proof "appliance". That's garbage. If it worked maybe it wouldn't be a problem, but in practice these things are never bullet-proof, and a failover in the cluster can easily mean no availability for the initiators for some short period of time. If you're working with Solaris 10, this can easily cause a connection timeout. Once that happens you must reboot the whole machine even if it's just one offline LUN.

It's a nightmare. Don't use Solaris 10.

snv_134 will reconnect eventually. Much smoother experience. So you zpool mirror your LUNs. Now you can take each SAN box offline for routine maintenance without issue. If one of them out-right fails, even with dozens of exported LUNs you're looking at a minute or two while the Dom0 compensates for the event and stops blocking IO.

These systems are very fast. Much faster than local storage is likely to me without throwing serious dollars at it.

These systems are very reliable. Since they can be snapshotted independently, and the underlying file-systems are themselves very reliable, the risk of data-loss is so small as to be a non-issue.

They can be replicated easily to tertiary storage, or offline incremental backup easily.

To take the system out, would require a network melt-down.

To compensate for that you spread link-aggregated connections across stacked switches. If a switch goes down, you're still operational. If a link goes down, you're still operational. The SAN interfaces are on their own VLAN, and the physical interfaces are dedicated to the Dom0. The DomU's are mapped to their own shared NIC.

The Dom0, or either of it's NICs is still a single point of failure. So you make sure to have two of them. Applications mount HA-NFS shares for shared media. You don't depend on stupid gimmicks like live-migration. You just run multiple app instances and load-balance between them.

You quadruple your (thinly provisioned) storage requirements this way, but this is how you build a bullet-proof system using networked storage (both block (iSCSI) and filesystem (NFS)) for serving web-applications.

If you pin yourself to local storage you have massive replication costs, you commit yourself to very weak recovery options. Locality of your data kills you when there's a problem. You're trading effective capacity planning for panic fixes when things don't go so smoothly.

This is why it takes forever to provision anything at Rackspace Cloud, and when things go wrong, you're basically screwed.

Because instead of proper planning, they'd rather not have to concern themselves with availability of your systems/data.

It's not a walk in the park, but if you can afford to invest in your own infrastructure and skills, you can achieve results that are better in every way.

Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN systems, but that matters mostly if you're trying to squeeze margins as a hosting provider. Their problems are not ours...

chubot 15 years ago | | |

The point of the article is that you are taking an ancient interface and using it for something new. Millions of lines of code was written against that interface with old assumptions, and now you've moved it to a new implementation without changing any of it. Things are bound to go wrong.

When you move sqlite to NFS, for example, file locking probably won't work. There is nothing to tell you this.

It sounds like you have experience making NFS work well, but I don't see how anything you wrote addresses this point. In fact I think you're just echoing some of the article's points about "enterprise planning". AFAICT you come from the enterprise world and are advocating overprovisioning, which is fine, but not the same context.

edw 15 years ago | | |

I brought up NFS because it's an example of a service that implements an abstraction but does so in a way that undermines the assumptions of the implemented abstraction. I do not disagree that local disks are an unrealistic strategy for creating a scalable, fault-tolerant system. The disk abstraction is of limited utility when creating such systems, because "disk thinking" leads to giving in to seductive assumptions about the performance and reliability of the storage resources you have at your disposal.

agazso 15 years ago | |

I hate NFS for the same reasons that you just wrote, but to be honest it is not a protocol issue, but rather an implementation issue.

If NFS were implemented totally in userspace (like FTP), it would not hang the entire system when something breaks. On the other hand, it would be much slower than it is, therefore it would be unsuitable for a lot of use-cases where it is used now.

moe 15 years ago | | |

NFS hangs the system on failures because it is a shoddy implementation, not because it happens to be implemented in kernel space.

I think the old CVS quote by Tom Lord applies here:

  CVS has some strengths. It's a very stable piece of
  code, largely because nobody wants to work on it anymore.

prodigal_erik 15 years ago |

He didn't touch on Joyent's 2+ day partial outage a couple months ago: http://news.ycombinator.com/item?id=2269329

jamie 15 years ago | |

Don't forget about this: http://www.datacenterknowledge.com/archives/2008/01/15/joyen...

jamie 15 years ago | | |

I think both of these links illustrate that errors happen, mistakes happen, software has bugs, and murphy's law always strikes. The question is, when it strikes, do you have enough control to fix the problem? If you've outsourced the solution, does the provider have enough control/knowledge to fix the problem?

These things will get much worse before they get better, and it's best to think of all these abstractions as being a double edge sword.

SoftwareMaven 15 years ago |

Many things in software are impossible magic, until they are not. His argument boils down to "it is a hard problem that nobody has solved yet." That doesn't mean nobody will ever solve it.

Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.

blantonl 15 years ago | |

Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.

That presumption assumes that the application is being used as the right tool to resolve the problem. And it also assumes that "the problem" is a finite and solvable item.

sigil 15 years ago | | |

> And it also assumes that "the problem" is a finite and solvable item.

Yes. To make this a bit more concrete, if "the problem" is making distributed storage look and behave exactly like local storage, the CAP Theorem has something to say about its solvability.

johnb 15 years ago |

It's funny how disk abstractions get you every time.

We used to store and process all of our uploads from our rails app on a GFS partition. GFS behaved like a normal disk most of the time, but we started having trouble processing concurrent uploads and couldn't replicate in dev.

It turned out so GFS could work at all, it had different locking than regular disks. Every time you created a new file it had to lock the containing folder. We solved it by splitting our upload folder in 1000 sequential buckets and wrote each upload to the next folder along... but it took us a long time to stop assuming it was a regular disk.

sciurus 15 years ago | |

FWIW, this behavior is explained early on in the documentation for GFS2.

johnb 15 years ago | | |

As we were using EngineYard for hosting at the time, everything was set up for us and we never thought to look it up.

We now pay a lot more attention to underlying stack. Just because you've outsourced hosting (either cloud or managed physical servers), you really need to know every component yourself.

spullara 15 years ago |

Also worth noting is that Amazon isn't forcing you to use EBS. They also have tons of fast local storage available to RAID as you wish.

lindvall 15 years ago | |

I strongly believe one of the most positive aspects of EC2 was that it demonstrated a beautiful philosophy that a node and their disks should not be relied upon to always be around and pushed it into the mainstream.

Even for people who didn't use EC2 the existence of the platform caused more people to rethink their architectures to try to rely less on Important Nodes.

EBS is a step back from that philosophy and it's a point worth noting.

One of the great things this post does is enumerates some of the underlying reasons why relying on EBS will inevitably lead to more failures and in ways that are harder and harder to diagnose.

leoc 15 years ago | | |

> EBS is a step back from that philosophy and it's a point worth noting.

Amazon doesn't use EBS itself, right? Isn't EBS something that AWS allowed its customers to nag it into against (what it considers) its better judgement?

blantonl 15 years ago | |

I agree. We run all of our MySQL and Mongo slave servers with local RAID-0 ephemeral storage. One dies? So what, we remove it from the pool and provision another.

Only our master and our slave backup server runs on EBS. We aren't as write oriented so we can live with some of the limitations of EBS, but we've even considered moving our master MySQL and Mongo servers to ephemeral storage and just relying on our slave back database server to run on EBS (for which we take freeze/snapshots of often). That server rarely ever falls more behind in relay updates.

epi0Bauqu 15 years ago | | |

Do you run xlarge to get the extra ephemeral disks?

cagenut 15 years ago |

Its really fascinating to watch amazon re-learn/re-implement the lessons IBM baked into mainframes decades ago. Once you get out of shared-nothing/web-scripting land you realize that I/O is much more important and difficult than cpu. What amazon calls EBS IBM has been calling "DASD" forever. I wonder if there are any crossover lessons that they haven't taken advantage of because there just aren't any old ibm'ers working at amazon.

blantonl 15 years ago | |

IBM's implementation of DASD on the mainframe was always implemented under the assumption that it was a secondary storage medium for data. Meaning, it wasn't accessed often, and it wasn't implemented for top performance.

Think of a bridge between high performance disk and tape.

spudlyo 15 years ago |

Trying to use a tool like iostat against a shared, network provided block device to figure out what your level of service your database is getting from the filesystem below it is an exercise in frustration that will get you nowhere.

This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an iostat that shows the average i/o request latency (await) for a disk, network or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms or more if your i/o requests get serviced at all.

alecco 15 years ago |

Amazon Six Sigma "Blackbelts", meet Mr. Black Swan.

Edit: my point is you can't hide unexpected/unknown events on statistical models; we should know better, coming from CS.

lobster_johnson 15 years ago |

> It’s commonly believed that EBS is built on DRBD with a dose of S3-derived replication logic.

Actually, it was discovered some time ago (http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...) that EBS probably used Red Hat's open-source GNDB: http://sourceware.org/cluster/gnbd/

CPlatypus 15 years ago |

He only gets it half right. A filesystem interface instead of a block interface is the right choice IMO. Private storage instead of distributed storage is the wrong choice for capacity, performance, and (most importantly) availability reasons. They didn't go with a ZFS-based solution because it was the best fit to requirements. They went with it because they had ZFS experts and advocates on staff.

As Schopenhauer said, every man mistakes the limits of his own vision for the limits of the world, and these are people who've failed to Get It when it comes to distributed storage ever since they tried and failed to make ZFS distributed (leading to the enlistment of the Lustre crew who have also largely failed at the same task). If they can't solve a problem they're arrogant enough to believe nobody can, so they position DAS and SAN as the only possible alternatives.

Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind of distributed storage people should be using for this sort of thing. I've also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly of Sun and now of Joyent, about ZFS FUD.