Love Your Bugs(akaptur.com) |
Love Your Bugs(akaptur.com) |
https://www.youtube.com/watch?v=eSaFVX4izsQ
and if you want to get into the weeds with any of them these are largely published publicly on aphyr.com; see for example:
There's a lot of "oh, here's how this dirty read from the underlying system became a much bigger bug in the system we built on top of it!" but what I like most about Kyle is that he is generally pretty great about having an attitude of "these are actually really hard problems and it's not surprising that there are implementation bugs when you let me mess with the clocks and cut off nodes and whatnot."
The likelihood that you’ve seen multiple flips on the same piece of data .. sounds like a typical threading bug.
Google's research [1] finds a DRAM error rate of "25,000 to 70,000 errors per billion device hours per Mbit" on hardware in "modern compute clusters." If there are 100 million Dropbox clients out there, Dropbox clients should encounter 2,500 to 7,000 errors per Mbit per hour, though factoring in the "low-end or old hardware" that many Dropbox clients are running on, the error rate is probably somewhat higher. For the sake of making the math simple, call it 10k errors per Mbit per hour, or 1 error per 100 bit-hours. So a given bit should flip on some user's machine on the order of once a week. That seems pretty firmly in the range of "sometimes we see these weird errors that we don't really understand," especially if you multiply by the same error potentially coming from different parts of the program (so "the same piece of data" is really "a few pieces of data that get collapsed together for purposes of analysis).
Your intuition that a "typical threading bug" is much more likely than a random bit flip is spot on, but that actually works in favor of the "random bit flip" thesis. On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.
[1]
Threading bugs could have any kind of frequency, though. The ones that are unfrequent are the ones that tend to make it into production...
I've seen stranger things than this, including a cluster of servers in which 3 of the 4 machines had frequent bit flips, and when they were all replaced in response, 2 of the 4 new machines also had frequent bit flips. We diagnosed based on this same kind of distributional evidence (in our case, the bit flips occurred in certain address hyperplanes, consistent on each machine, like having bit 5 flip at an address like 0x???6?3B0 every time). The customer was, as you can imagine, pretty skeptical that this was not a software bug at that point. But all of the machines, when booted into memtest86 or whatever it's called, quickly found errors with the predicted physical address pattern! Dropbox doesn't have the luxury of tracking down the customer machines and testing them.
See also "bitsquatting": https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...
If it doesn't have ECC memory, it's an approximate computing device.
Second, these strings are most likely concatenated on the fly right before sending them over to the server. So it wouldn't be a disk bit flip, it'd be in-memory, and for the life span of that particular string.
Third, if these were that frequent, then it means the _rest_ of the string, ie. the actual hashes, would be wrong, and that would seriously impair the service, wouldn't it?
And yet some people still claim that they don’t need ECC memory.
By the time the DDoS was in effect, the corrupted logs had been deleted by the client. They would now always succeed (even with the old server code, or old client code) until they got a new corrupted log.
This is the equivalent of talking about cars engines, and how some can get to higher speed than others, when in the end of the day you can probably reach your destination even with an old car engine, that can only drive and accelerate relatively slowly. It didn't magically improve/grow so you can reach your destination. You reach it because you kept the car moving.
https://henrikwarne.com/2012/10/21/4-reasons-why-bugs-are-go...