Silent data corruptions at scale (2021)(arxiv.org) |
Silent data corruptions at scale (2021)(arxiv.org) |
I had memory stability issues which would immediatly show under Prime95 (less than 1 minute) but pass hours of Linpack.
I guess I'd probably be okay with that if the only thing I ever used the computer for was gaming.
Meta quickly detects silent data corruptions at scale - https://news.ycombinator.com/item?id=30905636 - April 2022 (95 comments)
Silent Data Corruptions at Scale - https://news.ycombinator.com/item?id=27484866 - June 2021 (12 comments)
Compressing data, with the increased information density & greater number of CPU instructions involved, seems obviously to increase the exposure to corruption/ bitflips.
What I did wonder was why compress the filesize as an exponent? One would imagine that representing as a floating-point exponent would take lots of cycles, pretty much as many bits, and have nasty precision inaccuracies at larger sizes.
That's not a technical error, they mean SRAM in the CPU itself. You're right about gcj but that's kind of a moot point when investigating some reproducible CPU bug like this. The paper mentions all the acrobatics they went through when trying to find the root cause but if gcj would have been practical then it also would have been immediately clear if the gcj output reproduced the error or not. If it didn't reproduce, no big deal, try another approach. You might be right about it being easier to root cause with gdb directly but I'm not so sure. Starting out, you have no idea which instructions under what state are triggering the issue so you'd be looking for a needle in a haystack. A crashdump or gdb doesn't let you bisect that haystack so good luck finding your needle.
It all depends how good you are with x64 assembly. If you are good enough, you can easily deduce what the instructions at the location do, and can potentially simply copy-paste into an asm file, compile it and check result. Would be much faster to me.
Bluntly speaking, people who are not familiar with low-level debugging make an honest and succesful attempt to investigate a low-level issue. A seasoned kernel developer or reverse engineer would have just used gdb straight away.
None of this is free, and there are both hardware and software solutions for mitigating various categories of risk. It is explicitly modeled as an economics problem i.e. how does the cost of not mitigating a risk, if it materializes, compare to the cost of minimizing or eliminating it. In many cases, the optimal solution is unintuitive, such as computing everything twice or thrice and comparing the results rather than using error correction.
As for adding ECC within the CPU, I think that would require you to essentially have a second CPU in parallel to compare against.
2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.
From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.
Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.
I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.
We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.
I think you should take another look at the author list. Chris Mason counts as a seasoned kernel developer in my book. Either way I think you're missing the point. Yes gcj would be different, but there's a decent chance it could hand you a binary that reproduces the issue that you can bisect to the root cause from there. It's one thing to run it through gcj and see if it reproduces, rewriting it in C is a ton of work compared to gcj for something that might not pan out.
You should probably invite these people themselves to the discussion instead of speaking on their behalf. Not productive.
The other one that freaks me out is miscompilation by the compiler and JITs in the data path of an application. Like we’re using these machines to process hundreds of millions of transactions and trillions of dollars - how much are these silent mistakes costing us?
Even with 3 chips, if one is permanently wrong you are then left with only 2 working ones so no redundancy is left for further degradation.
> just compute 3x
That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.
Again, cool to work on at Google. Not sure anybody else cares. If you care (finance) you fix it in hardware (system z).
Yes but you need to catch it first to know what to take out of production.
>That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.
That's kind of my point. Either it's a heisen-bug and you never see those results again when you repeat the original program or it's permanently broken and you need to swap out the sketchy CPU. If you only care about the first case then you only need one core. If you care about the second case then you need 3 if you want to come up with an accurate result instead of just determining that one of them is faulty. It's like that old adage about clocks on ships. Either take one clock or take three, never two.
> 2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.
If you care about resilience then you either need to settle with one and accept that you can't catch the class of errors that are persistent or go with three if you actually need resilience to those failures as well. If you don't need that kind of resilience like an aerospace application would need then you're probably better off with catching this at a higher layer in the overall distributed systems design. Rather than trying to make a resilient and perfectly accurate server, design your service to be resilient to hardware faults and stack checksums on checksums so you can catch errors (whether HW or software) where some invariant is violated. Meta also has a paper on their "Tectonic filesystem" where there's a checksum of every 4K chunk fragment, a checksum of the whole chunk, and a checksum of the erasure encoded block constructed out of the chunks. Once you add in yet another layer of replication above this then even when some machine is computing corrupt checksums or inconsistent checksums where the checksum and the data are corrupt then you can still catch it and you have a separate copy to avoid data loss.
What's worrying is when systems like these get used in real-time life-and-death situations, and there's basically no reversibility because that would imply dead people returning to life. For example the code used for stuff like outer space exploration, sure that right now we can add lots and lots of redundancies and check-ups in the software being used in that domain because the money is there to be spent and we still don't have that many people out there in space. But what will happen when we'll think of hosting hundreds, even thousands of people inside a big orbital station? How will we be able to make sure that all the safety-related code for that very big structure (certainly much bigger than we have now in space) doesn't cause the whole thing to go kaboom based on an unknown-unknown software error?
And leaving aside scenarios that are not there yet, right now we've started using software more and more when it comes to warfare (for example for battle simulations based on which real-life decisions are taken), what will happen to the lives of soldiers whose conduct in war has been lead by faulty software?
So many soldiers on both sides died because of really dumb commander decisions, missing kit, political needs, that worrying about CPU errors is truly way way down the list.
As per the sources, here's this one in The Economist [1] from September 2023, just as it had become obvious that the counter-offensive had fizzled out:
> American and British officials worked closely with Ukraine in the months before it launched its counter-offensive in June. They gave intelligence and advice, conducted detailed war games to simulate how different attacks might play out
And another one from earlier on [2], in July 2023, when things were stil undecided:
> Ukraine’s allies had spent months conducting wargames and simulations to predict how an assault might unfold.
There was an article about a giant real 3D map in a room with all the commanders there discussing what each one has to do, contingencies, ...
The reason for the counter-offensive failure were multiple and complex, good or bad software would not have changed the result significantly.