Distributed Systems Shibboleths(jolynch.github.io) |
Distributed Systems Shibboleths(jolynch.github.io) |
> Every component is crash-only
I was part of the team that developed a distributed, five-9's control system for an industry where downtime costs millions per minute and comes with a federal investigation if long enough. On top of that, the industry is made up of competitors that explicitly distrust each other, so all components had to be truly distributed, with no central coordination for anything.
Given the requirements we decided to explicitly adopt a crash-only approach. Between idempotent operations, horizontal scaling, and fast restart times, we could make failing components not impact SLAs (and we had testing to ensure it).
Once it gets out into the field (which because of how risk adverse this industry is, is measured in years), it turns out they really did not like software crashing. They interpreted crashing as bad quality, and no amount of "we do it on purpose to ensure correctness" was going to make them happy.
The solution here is to rebrand it with some vague euphemism:
“Ah yes the component underwent a state calibration”
Jokes aside, there is even a body of research papers around this subject, if you need some backing.
> The main advantage of distributed transactions is that they make distributed systems look less distributed by choosing CP, but that inherently trades off availability!
This is true, but I suspect that its slightly missing the important thing about transactions. A transaction is an operation that takes the database from one Consistent (ACID "C", you can think about it as "legal under the business logic") state to another Consistent state. Linearizability (CAP "C") isn't enough to do that, because often changes in databases require "take from Bob and give to Alice", or "check Bob and Alice's balance and add the order", neither of which fit well into Linearizability's single-key definition of the world. Allowing developers to think about a stream of operations that moves the databases from one legal state to another is super powerful. The whole point is that it provides an abstraction that hides concurrency (ACID "I") and partial failure (ACID "A"). Saving developers from reasoning about those is a big win!
> I should also note that while Distributed Transactions might be a useful tool in building idempotency, simply wrapping a non idempotent operation (e.g. “add 1 to X”) in a transaction does not make it idempotent.
The OP is right that this isn't a panacea, especially where those transactions aren't idempotent. But transactions are a mechanism to implement idempotence ("insert order number 10 if it isn't there already"), and idempotence and ACID "C" can be really hard to achieve without transactions (or at least "I" and "A").
Transactions, CRDTs, and the CALM theorem are linked too. You can definitely have transactions in systems that aren't CAP "C" consistent, and still have them do legal things. The CALM theorem lays out one way to think about those, and CRDTs are a kind of object-oriented embodiment of that theory.
I do think CRDTs or idempotency/fencing tokens are also a valuable way to reason about state transitions, and they can provide much lower latency in a global distributed system.
You're right—it's super powerful.
We used this "RSM intuition" to test TigerBeetle's strict serializability, by verifying state transitions the instant they happen, taking advantage of the test simulator's knowledge of inflight client requests, instead of trying to piece everything together and verify strict serializability after the fact.
Here's TB's state checker in 49 lines of Zig:
https://github.com/coilhq/tigerbeetle/blob/477d6df366e2c10fa...
They're powerful because they can "load the dice" and so make the distributed system more intuitive for humans to reason about, more resilient to real world faults, and do all this with more performance.
For example, Barbara Liskov and James Cowling's deterministic view change from VSR [1][2], which isn't plagued by the latency issues of RAFT's randomized dueling leader problem. VSR's deterministic view change can react to a failed primary much quicker than RAFT since heartbeat timeouts don't require the randomized "padding" that they do in RAFT, commence the leader election, and also ensure that the leader election succeeds without a split vote.
Determinism makes all this possible.
Deterministic testing [3][4] is also your best friend when it comes to testing distributed systems.
[1] An introductory talk on VSR and it's deterministic view change — https://www.youtube.com/watch?v=Wii1LX_ltIs
[2] James Cowling on determinism, working with Barbara Liskov — https://www.youtube.com/watch?v=ps106zjmjhw
[3] FoundationDB are pioneers of deterministic testing — https://www.youtube.com/watch?v=OJb8A6h9jQQ
[4] TigerBeetle's deterministic simulation tests — https://github.com/coilhq/tigerbeetle#simulation-tests
Huh? How could a nondeterministic algorithm be predictable? Do you mean algorithms with a nondeterministic but overall irrelevant component ("pick a random element from this set")?
The section on distributed transactions could have a little more nuance. Particularly the example about the counter where I suspect any system offering transactions also has a CAS operation. Additionally the benefit of a transaction system is that you can offer bounded counters where as an AP or “strong” EC (CRDTs) system cannot.
For example, you could place a unique identifier on every count event and then roll
up those deltas in the background and transactionally advance a summary, either
preventing ingestion after some time delay or handling recounting.
Certainly transactions can help, but you still have to data model correctly for failure.For example TCP implements "exactly once processing" by your definition but you probably still want Stripe to include idempotency keys in their charge API so you don't pay twice.
Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"?
Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible".
Follow the guidelines in this post and you'll indeed result in (more) robust systems. Great writeup.
And here are a few more positive phrases: "total order", "committed/uncommitted log", "replicated state machine", "logical timestamp", "the network is not homogeneous", "bandwidth is not infinite", "tail latency tolerance", "fault model", "message passing", "recovery", "cascading failure", "metastable", "strict serializability", and "stable storage"!
Surprisingly, at least to me, it's jarring to hear the phrase "CAP" because the field is far more nuanced. That "C" alone has so many different flavors! Far more interesting to talk about the FLP result.
Watch out also for "RPC" because that's not isomorphic with optimal consensus protocols, where messages follow multi-path routing (A->C->B->A). Granted, RAFT uses the term "RPC" (A->B->A) but that's more a concession to accessibility and not the broadest way to think of distributed systems, or necessarily the best way to implement them. Message passing is simpler and more realistic as it helps you to see and think about the network/process fault model more clearly.
Distributed testing techniques are also moving towards autonomous deterministic testing, as pioneered by FoundationDB, where the database itself is the test simulator—these tend to get into more interesting state spaces that can also be replayed instantly from a seed, compared to external test harnesses that run for hours in real time and that can't reproduce distributed bugs deterministically in exactly the same way every time.
Come on, you know that's not what's going to happen. If they notice at all, they'll just incorporate the magic phrases into their BS so you have to hunt harder for a real signal.
I am not necessarily thinking just tests running as code. Although that would be nice.
Less /s, I'd be more worried if I thought the salescritters were listening, but they're mostly not.
[1] https://ignite.apache.org/docs/latest/key-value-api/continuo...
The second (lock) is actually a lease fwict, and writing code in the locked body that assumes it will be truly mutually exclusive is pretty dangerous (see the linked post from Martin [1] for why).
[1] https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...
You're a madman, but I think you're right.
Thanks for the locking link.
If you want to see a ton of lies, go and start reading through all the Jepsen posts tearing down those dubious claims.
A few more positive shiboleths.
One would be eventual consistency.
Another would be discussing write paths vs read paths (or patterns) and recognizing that those can be decoupled (or a mention of CQRS).
In the vein of the TFA, this would be a negative shibboleth (which just goes to show how being pedantic in certain contexts is just silly). The reason being that eventual consistency has no guarantee on what "eventual" means. If your replicas converge ten years from when a change is made, you can (correctly) claim to have eventual consistency.
Eventual consistency can be a rational choice or a quasi-necessity, but in practice it's a shortcut for careless optimists; I wouldn't expect them to analyze how inconsistent states can go wrong, whether they are an acceptable risk, and how to deal with abnormal situations.
[1] https://codahale.com/you-cant-sacrifice-partition-tolerance/
https://www.sefaria.org/Judges.12.6?ven=Tanakh:_The_Holy_Scr...
I prefer the term retryable to idempotent. If there's a failure in the first call, to be truly idempotent it should fail on the second.
Retryable on the other hand is easier to argue about. Important thing is not the response but the end state of the system.
Alternatively, idempotency applies to successful operations, orthogonal from error cases.
Retryability doesn't help in the case of the (bad) operation "+1" which is not idempotent.
I would argue that "infinite timeout" is another negative shibboleth.
every operation in a distributed system has some duration after which you can be 99.9% confident (or 99.9999%, or whatever threshold you want to pick) that it was lost to the void and will never return a result.
in a robust distributed system, you want to pick a reasonable timeout value, and then take appropriate action in response to the timeout. typically this is retrying the operation, bubbling up a failure message to a higher level, or some combination of the two (retry a few times, fail if all the retries fail).
an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure".
in my experience, infinite timeouts are often the cause of "hmm, this thing is up and running but seems 'stuck' and not making any progress, let me try manually restarting this service...OK, that seems to have recovered it" bugs and production alerts.
Zombie processes (dependent on some lock that will never clear) shouldn't be possible. At the very least abort (kill -9) should always be possible.
Failure should always be an option; it should be the default assumption. All other order must be wrested from that chaos.
The long answer is to peel this onion for yourself and see where it leads. It’s a lot of fun.
The idea of fate sharing is very general and useful: you can, for example, introduce reconnectable sessions, and attach shared state to those, which gets you transport-independence and the ability to recover from transport failure.
[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.
Example: size of a set = 1 + (remove an element from the set and then compute size of reduced set, or 0 if set is empty)
The kernel retains minimal state about them because the system has made a promise to report that the process exited to its parent process, and the parent process hasn't gotten around to asking for that yet.
(Don't confuse zombies with uninterruptible I/O sleep, or buggy kernel workers.)
Stuck mounts have a half solution (lazy unmounts) but even _that_ interface really also needs a timeout value after which operations on the target should be assumed to fail rather than return correctly.
Offhand, I wonder if there's currently or previously been a DoS attack based on defunct uninterruptible sleep. Theoretically a system could be exhausted of PIDs which could lead to nasty issues.
Once again, that cannot be done until the parent process consumes the exit status. That's what the zombie is there for. Zombies don't take up much space.
> Stuck mounts have a half solution (lazy unmounts) but even _that_ interface really also needs a timeout value after which operations on the target should be assumed to fail rather than return correctly.
These days most NFS etc mounts are "soft mounts", that is operations will eventually time out.
Lazy unmount doesn't really apply here, it makes the mountpoint disappear from the global namespace, but all existing open files remain untouched, and the mount lives as long as anything is still using it; it just removes the "entry point" to the mount.
On today's Linux, it's up to each filesystem to provide abort/timeout mechanism. For timeouts, this is the right design, as demonstrated by macOS complications with FUSE. I do wish there was a common way to make things abort.
There was a patch in circulation a long time ago, that could seamlessly switch all open FDs of any given mountpoint into a whole different filesystem named badfs. badfs would just return an error on any operation. As far as I know, that patch never got merged, probably because nobody ever got it working 100%.
That kind of a DoS would require a local attacker, and then the victim to access a mountpoint owned by the attacker. Using FUSE, you could get a lot of processes hanging like that, for sure. I guess you could trap a mail delivery agent, if you still had a system where mail was delivered to users' home directories.
However, forcibly aborting any FUSE mount is a single `echo 1 >/sys/fs/fuse/connection/NNNN/abort`, the only challenge is finding the right ID. (See https://github.com/bazil/fuse/blob/fb710f7dfd05053a3bc9516dd...)
A distributed system might be, for example, the ACH system: all batch, all store-and-forward, no replies flowing down the line, only processed message response batches dropped off “eventually” in outboxes.
Or, for another example: any workload manager, whether on an HPC cluster or your local Kubernetes node. No synchronous workload execution; just async batch scheduler enqueue with later best-effort status querying.
Saying to people "a system somewhere is blocked for possibly forever, so too bad we cannot do your thing" is our reality. Our system exist for their impact on people.
Otherwise they are art... which also exist for its impact on people.
Think of how in e.g. an IaaS control plane, when you delete a VM, it may take an arbitrarily-long time before you can create another VM with the same ID. (Maybe forever!) But you can always create a VM with a different ID, that otherwise fulfills all the same purposes (e.g. has the old instance's IP, FQDN, etc.) The old ID essentially has a distributed lock on its use, with an unbounded release time — and that's perfectly fine for the use-case.
For an example of fail-stalled being not only practical but preferred, consider tag-out locking systems (exclusive-access locks used to prevent machines from being turned on while maintenance is being performed on them.) If there was a digital lock of that type, you wouldn't want to ever automatically time it out. A human put that lock there, to keep them alive. They'll take it off when they're done. If you really suspect someone forgot to unlock the tag-out lock, you can always go and check with the lock's acquirer. But if you can't get in contact with them, you can't know that they don't still have their hands up in the gears of the machine. And in this case, failing to auto-restart the assembly line (until the "partition" is over and you can just ask the maintenance worker why they're still holding the lock) is worth much less than said maintenance worker's life.