Magical Block Store: Why EBS Can't Work(joyeur.com) |
Magical Block Store: Why EBS Can't Work(joyeur.com) |
I really think that one paragraph in his blog post summed everything up quite nicely. It could not ring more true:
My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for “consolidated idle workloads.” When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.
And now you have situations on a regular basis where you type "ls" and you shell hangs and not even "kill -9" is going to save you. And you go back to using FTP or some other abstraction that does not apply 40,000 hour MTBF thinking to equipment that disappears for coffee breaks daily.
The great quote by Leslie Lamport: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
And this was an excellent and honest article about faulty programming abstractions. It's basically bashing you over the head with the "Fallacies of distributed computing". Don't silently turn local operations into remote operations. They're not the same thing and have to be treated differently at all levels.
Even Werner Vogels wrote a diatribe against "transparency", which is the same issue by another name: http://scholar.google.com/scholar?cluster=700969849916494972...
So I wonder what he thinks of this architectural choice. You have to give up something when communicating over the network. Vogels seems to have chosen consistency rather than availability in his designs. This paper was a turning point in his research. Its candor surprised me.
The file system interface does not let you relax consistency, so by default you have chosen availability. As the Joyent guys honestly remarked, this often has to be learned the hard way.
An NFS server is very simple. With NFS on it's own VLAN, and some very basic QoS, there's no reason an NFS server should be the weak point in your infrastructure. Especially since it's resilient to disconnection on a flaky network.
If you're looking for 100% availability, sure, NFS is probably not the answer. If on the other hand you're running a website, and would rather trade a few bad requests for high-availability and portability, then NFS can be a great fit.
None of that has anything to do with EBS or block-storage though.
Joyent's position is that iSCSI was flaky for them because of unpredictable loads on under-performing equipment. The situation would degrade to the point that they could only attach a couple VM hosts to a pair of servers for example, and they were slicing the LUNs on the host, losing the flexibility networked block-storage provides for portability between systems.
Here's what we do:
We export an 80GB LUN for every running application from two SAN systems.
These systems are home-grown, based on Nexenta Core Platform v3. We don't use de-dupe since the DDT kills performance (and if Joyent was using it, then is local storage without it really a fair comparison?). We provide SSDs for ZIL and ARCL2.
These LUNs are then mirrored on the Dom0. That part is key. Most storage vendors want to create a black-box, bullet-proof "appliance". That's garbage. If it worked maybe it wouldn't be a problem, but in practice these things are never bullet-proof, and a failover in the cluster can easily mean no availability for the initiators for some short period of time. If you're working with Solaris 10, this can easily cause a connection timeout. Once that happens you must reboot the whole machine even if it's just one offline LUN.
It's a nightmare. Don't use Solaris 10.
snv_134 will reconnect eventually. Much smoother experience. So you zpool mirror your LUNs. Now you can take each SAN box offline for routine maintenance without issue. If one of them out-right fails, even with dozens of exported LUNs you're looking at a minute or two while the Dom0 compensates for the event and stops blocking IO.
These systems are very fast. Much faster than local storage is likely to me without throwing serious dollars at it.
These systems are very reliable. Since they can be snapshotted independently, and the underlying file-systems are themselves very reliable, the risk of data-loss is so small as to be a non-issue.
They can be replicated easily to tertiary storage, or offline incremental backup easily.
To take the system out, would require a network melt-down.
To compensate for that you spread link-aggregated connections across stacked switches. If a switch goes down, you're still operational. If a link goes down, you're still operational. The SAN interfaces are on their own VLAN, and the physical interfaces are dedicated to the Dom0. The DomU's are mapped to their own shared NIC.
The Dom0, or either of it's NICs is still a single point of failure. So you make sure to have two of them. Applications mount HA-NFS shares for shared media. You don't depend on stupid gimmicks like live-migration. You just run multiple app instances and load-balance between them.
You quadruple your (thinly provisioned) storage requirements this way, but this is how you build a bullet-proof system using networked storage (both block (iSCSI) and filesystem (NFS)) for serving web-applications.
If you pin yourself to local storage you have massive replication costs, you commit yourself to very weak recovery options. Locality of your data kills you when there's a problem. You're trading effective capacity planning for panic fixes when things don't go so smoothly.
This is why it takes forever to provision anything at Rackspace Cloud, and when things go wrong, you're basically screwed.
Because instead of proper planning, they'd rather not have to concern themselves with availability of your systems/data.
It's not a walk in the park, but if you can afford to invest in your own infrastructure and skills, you can achieve results that are better in every way.
Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN systems, but that matters mostly if you're trying to squeeze margins as a hosting provider. Their problems are not ours...
When you move sqlite to NFS, for example, file locking probably won't work. There is nothing to tell you this.
It sounds like you have experience making NFS work well, but I don't see how anything you wrote addresses this point. In fact I think you're just echoing some of the article's points about "enterprise planning". AFAICT you come from the enterprise world and are advocating overprovisioning, which is fine, but not the same context.
If NFS were implemented totally in userspace (like FTP), it would not hang the entire system when something breaks. On the other hand, it would be much slower than it is, therefore it would be unsuitable for a lot of use-cases where it is used now.
I think the old CVS quote by Tom Lord applies here:
CVS has some strengths. It's a very stable piece of
code, largely because nobody wants to work on it anymore.These things will get much worse before they get better, and it's best to think of all these abstractions as being a double edge sword.
Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.
That presumption assumes that the application is being used as the right tool to resolve the problem. And it also assumes that "the problem" is a finite and solvable item.
Yes. To make this a bit more concrete, if "the problem" is making distributed storage look and behave exactly like local storage, the CAP Theorem has something to say about its solvability.
We used to store and process all of our uploads from our rails app on a GFS partition. GFS behaved like a normal disk most of the time, but we started having trouble processing concurrent uploads and couldn't replicate in dev.
It turned out so GFS could work at all, it had different locking than regular disks. Every time you created a new file it had to lock the containing folder. We solved it by splitting our upload folder in 1000 sequential buckets and wrote each upload to the next folder along... but it took us a long time to stop assuming it was a regular disk.
We now pay a lot more attention to underlying stack. Just because you've outsourced hosting (either cloud or managed physical servers), you really need to know every component yourself.
Even for people who didn't use EC2 the existence of the platform caused more people to rethink their architectures to try to rely less on Important Nodes.
EBS is a step back from that philosophy and it's a point worth noting.
One of the great things this post does is enumerates some of the underlying reasons why relying on EBS will inevitably lead to more failures and in ways that are harder and harder to diagnose.
Amazon doesn't use EBS itself, right? Isn't EBS something that AWS allowed its customers to nag it into against (what it considers) its better judgement?
Only our master and our slave backup server runs on EBS. We aren't as write oriented so we can live with some of the limitations of EBS, but we've even considered moving our master MySQL and Mongo servers to ephemeral storage and just relying on our slave back database server to run on EBS (for which we take freeze/snapshots of often). That server rarely ever falls more behind in relay updates.
Think of a bridge between high performance disk and tape.
This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an iostat that shows the average i/o request latency (await) for a disk, network or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms or more if your i/o requests get serviced at all.
Edit: my point is you can't hide unexpected/unknown events on statistical models; we should know better, coming from CS.
Actually, it was discovered some time ago (http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...) that EBS probably used Red Hat's open-source GNDB: http://sourceware.org/cluster/gnbd/
As Schopenhauer said, every man mistakes the limits of his own vision for the limits of the world, and these are people who've failed to Get It when it comes to distributed storage ever since they tried and failed to make ZFS distributed (leading to the enlistment of the Lustre crew who have also largely failed at the same task). If they can't solve a problem they're arrogant enough to believe nobody can, so they position DAS and SAN as the only possible alternatives.
Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind of distributed storage people should be using for this sort of thing. I've also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly of Sun and now of Joyent, about ZFS FUD.
They are typically available as /dev/sd[bcde]
In centOS, implementing a RAID-0 block device across 2 ephemeral disks that is present on an m1.large instance can be done via the following:
mdadm --create /dev/md0 --metadata=1.1 --level=0 --quiet --run -c 256 -n 2 /dev/sdb /dev/sdc
You'll then need to format the block device with your fs of choice. Then mount it from there.
The SAN solutions they migrated to are not ZFS based. Unless I'm mis-remembering (I read this a couple days ago) they were only using ZFS to slice LUNs.
Point is, you're taking pot-shots at ZFS when the main thrust appears to be: "It was hard to make iSCSI reliable. Once we did, by buying expensive storage-vendor backed solutions, we found it wasn't financially compelling."
They're a hosting provider. If it takes a replicated SAN pair (which is the wrong way to go about it BTW, though admittedly it's the way the storage vendors and their "appliance" mentality would have it done) to service just a pair of VM hosts (they're still using Zones right?) then it just didn't make sense money-wise for them. If they planned capacity to provide great performance, they weren't making enough money on the services for what they were selling them for.
That's not an "iSCSI is unreliable" problem. It's not a "networked storage is broken" problem. It's not a "networked storage is slow" problem. It's not even a "ZFS didn't work out" problem.
If you go out and spend major bucks on NetApp, not only are you going to have to deal with all the black-box-appliance BS, but it's going to cost a lot of money. A LOT. And DAS is going to end up cheaper to deploy, maintain, and your margins are going to be a lot higher.
DAS is the right choice for a hosting provider who wants to maximize their profits in a competitive space.
It's not the best choice for performance, availability or flexibility for clients though. So you have to ask yourself what kind of budget you have to work with, and what goals are important to you?
BTW, there's _budget_, and then there's NetApp/EMC budget. Just because you need/want more than DAS can give you doesn't mean you need to tie your boat to an insane Enterprise grade budget.
As for "DAS is the right choice" that's just wrong on many levels. First, people who know storage use "DAS" to both private (e.g. SATA/SAS) and shared (e.g. FC/iSCSI) storage, so please misusing the term to make a distinction between the two. Second, I don't actually recommend either. I don't recommend paying enterprise margins for anything, and I don't recommend more than a modicum of private storage for cloud applications where most data ultimately needs to be shared. What I do recommend is distributed storage based on commodity hardware and open-source software. There are plenty of options to choose from, some with all of the scalability and redundancy you could get from their enterprise cousins. Just because some people had some bad experience with iSCSI or DRBD doesn't mean all cost-effective distributed storage solutions are bad and one must submit to the false choice of enterprise NAS vs. (either flavor of) DAS.
In short, open your eyes and read what people wrote instead of assuming this is the NAS vs. DAS fight you're used to.
It's not that I believe in overprovisioning I think. It's that if data is really that critical, and it's availability is critical, then that has to be taken into account during planning.
Everything fails at some point. The Enterprise Storage Vendors would have you believe their stuff doesn't. In practice it's pretty scary when the black box doesn't work as advertised anymore though _after_ you've made it the centerpiece of your operations.
So with those lessons learned, our replacement efforts took into account the level of availability we wanted to achieve.
I did go off on an NFS tanget. Sorry. But this article was about block-storage, which is a different beast from what you describe.
Seeing all networked storage lumped together is like seeing: FastCGI isn't 100% reliable, which is why I hate two-phase-commits.
ephemeral
WHEN your EC2 node disappears (and it will), you will lose everything on that RAID.
That's not a bad thing if you know it'll happen and plan for it, but do be aware of it.
yep. NFS and the like make you more vulnerable to the Fallacies of Distributed Computing.
Technically true, although you don't have to contend with the consistency or partitioning factors in the local disk case -- there's only one copy of the state. This means you can focus on making the availability factor as close to 1.0 as possible.
This may not be the case when you're forced to balance all three CAP factors. I sometimes wonder if a follow on result to CAP will be a "practical" (physical or information theoretic) limit like C x A x P <= 1-h for some constant h, and we'll just have to come to terms with that as computer scientists, as physics had to with dx x dp >= h. This is of course wildly unsubstantiated pessimism.
Also, I would gladly entertain any argument demolishing the "local disks are not subject to CAP" claim I made above by talking about read / write caches as separate copies of the local disk state.
What about FC SANs or iSCSI over a WAN? Are they local or distributed?
Seriously. You tell me. What does that have to do with your rant on ZFS? It could have as well been an LSI controller doing RAID6. Or mdadm. Doesn't matter.
That's the evolved solution they came up with.
The "networked storage is broken" pitch actually comes in with the EMC/NetApp interim solution as well. I don't buy it either, but it's a joke to claim the problem was ZFS on the Zones when the Targets weren't running ZFS.
You're awfully prickly, but I didn't suggest it came down to "Enterprise" NAS vs DAS. I actually think networked storage is here to stay (and that's a good thing).
I have my doubts we'll see a stable, inexpensive (or free) Distributed or Clustered file-system ready to replace traditional solutions anytime soon. I'm happy to see people try though.
You clearly have an axe to grind with ZFS though. In my experience it's been by far more stable than any available Linux FS I've used. Pull the power again and again, replace and resilver all you want. Manage terabytes and don't worry about corruption. I wouldn't trust ext3/4fs for anything I couldn't stand to lose...
PS: http://en.wikipedia.org/wiki/Direct-attached_storage
"People who know storage". I don't see iSCSI on that list. Nor FCoE. DAS (at least according to Wikipedia) explicitly rules out switching. Which is how I've always viewed it.
So you aren't calling ZFS a "crappy solution"? Just the DAS usage?
What is your gripe exactly then? The overblown critique of networked storage? Well we agree on that at least then. I think.
Honestly, with all the "read the fucking article", it's-not-DAS, oh-it-is, CloudFS is way moar better than ZFS, I never said ZFS sucked, "Bryan ZFS Cantrill is a jackass" you've left me absolutely bewildered at what your intended point (if any) actually is?
For the record, my only comment on (free) distributed filesystems (that aren't vendor-locked and actually unusable to me) is that I wouldn't personally trust them with my data. Not until they have the features I need, and then are running out in the wild, widley deployed for a couple years so I'm not a guinea pig.
I'll even throw you a bone: Even just last year ZFS was having major melt-downs when a new inadequately vetted feature was added. A few years ago it wasn't uncommon to face corruption when trying to do fairly routine things managing disks. Bugs can and do happen.
Maybe CloudFS, or Gluster is ready for prime-time, housing terabytes of data reliably and never making a misstep. I just don't think it's smart to bet your business on it. Not at least without a plan B since moving data around isn't an option when you're down and have terabytes you need to get back online.
You mean, in that case tolerance to partition and availability should be perfect.
> Perhaps someday we may be able to say that the odds of enough partitions or machine failures to make the system unavailable are lower than the odds of you getting struck by lightning, at which point you will have for practical purposes defeated the constraints of the CAP theorem.
So this is the really interesting question. All the CAP theorem says is that (C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)? If infinitely close, then we have achieved perfection by the limit, and the CAP theorem is rather pointless. If not, then what is the numeric limit?
As you speculate, maybe the numeric limit on C x A x P is so close to 1.0 that the odds of seeing a consistency, availability, or partitioning problem are much smaller than getting hit by lightning.
Then again, maybe not. Who knows? ;)
To avoid sounding like a total crackpot, here is an interesting paper that explores the physical limits of computation:
No. If a network is never partitioned, you don't need to write algorithms that can tolerate partitions. Therefore consistency and availability are possible.
> So this is the really interesting question. All the CAP theorem says is that (C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)? If infinitely close, then we have achieved perfection by the limit, and the CAP theorem is rather pointless. If not, then what is the numeric limit?
I think you have misunderstood the theorem (at least, if my bachelor-degree-level understanding is correct). C, A, and P are not variables you can multiply together or perform mathematical operations on. They are more like booleans. "Is the web service consistent (are requests made against it atomically successful or unsuccessful)?" "Is the web service available (will all requests to it terminate)?" "Is the web service partition-tolerant (will the other properties still hold if some nodes in the system cannot communicate with others)?" These questions cannot be "0.5 yes". They are either all-the-way-yes or all-the-way-no.
> . . . and the CAP theorem is rather pointless
Not really. It is pointful for networks that experience partitions. It just doesn't apply to reliable networks. It also sort-of doesn't apply when an unreliable network is acting reliably, with the caveat that since it is not possible to tell in advance when a network will stop behaving reliably, you still have to choose between these three properties when writing your algorithms for when the network behaves badly.
Right, but I wasn't restating CAP, just wondering about a follow on to CAP that considers the probability of remaining consistent, the probability of remaining available, and the probability of no failures due to network partitions in physical terms.
Is this not an interesting thing to consider? What if someone proves a hard limit on the product of these probabilities in some physical computation context? The CAP theorem is absolutely fascinating to me, especially if it has something real to say about the systems we can build in the future. The future looks even more distributed.
> It is pointful for networks that experience partitions. It just doesn't apply to reliable networks.
Is there such a thing as a "reliable" network when thousands or millions of computational nodes are involved? Are the routers and switches which connect such a network 100% available? If an amplification attack saturates some network segment with noise, what then?
As programmers, we desperately want things to work, and it's easy to greet something like CAP with flat out denial. I know I'm always fighting it. "It will never fail." No, it can and will fail.
Maybe what you mean is the probability of whichever of C, A, or P you gave up actually becoming a problem? But I cannot imagine a physical law of the form you are referring to applying uniformly to these disparate properties. I wouldn't even know how to formulate it for consistency. For availability and partition tolerance it would just be, "Requests to this service will (availability: hang forever/partition-tolerance: return with errors) at a rate exactly equal to the probability of network failures."
With regards to your last point, there are no reliable networks, at least where I work. That doesn't mean there won't be.
That's weird because I've read the proof and they speak only of boolean instances of C, A, and P in the proof. They give no examples of systems where any of the three variables have values other than zero or one.