Silent data corruptions at scale (2021)

Silent data corruptions at scale (2021)(arxiv.org)

84 points by losfair 2 years ago | 39 comments

Very interesting topic, but rather low on detail --- really wanted to see what those 60 lines of Asm that allegedly show a faulty CPU instruction were, and also surprised that it wasn't intermittent; in my experience, CPU problems usually are intermittent and heavily dependent upon prior state, and manually stepping through with a debugger has never shown the "1+1=3" type of situation they claim. That said, I wonder if LINPACK'ing would've found it, as that is known to be a very powerful stress-test with divisive opinions among the overclocking community; some, including me, claim that a system can never be considered stable it if fails LINPACK since that is essentially showing intermittent "1+1=3" behaviour, while others are fine with "occasional" discrepancies in its output since the system otherwise appears to be stable.

jorticka 2 years ago | |

Like all stress tests, linpack will find some errors, but not all.

I had memory stability issues which would immediatly show under Prime95 (less than 1 minute) but pass hours of Linpack.

sirlancer 2 years ago | | |

Prime95 is my gold standard for CPU and memory testing. Everything from desktops to HPC and clustered filesystems get a 24 hour “blend” of tests. If that passes without any instability or bit flips then we’re ready for production.

thfuran 2 years ago | |

>while others are fine with "occasional" discrepancies

I guess I'd probably be okay with that if the only thing I ever used the computer for was gaming.

dang 2 years ago |

Meta quickly detects silent data corruptions at scale - https://news.ycombinator.com/item?id=30905636 - April 2022 (95 comments)

Silent Data Corruptions at Scale - https://news.ycombinator.com/item?id=27484866 - June 2021 (12 comments)

dataflow 2 years ago |

Google also had a "Cores That Don't Count" paper on so-called "mercurial cores" https://news.ycombinator.com/item?id=27378624 as well as a presentation https://www.youtube.com/watch?v=QMF3rqhjYuM

ekelsen 2 years ago |

I wrote an article about these affecting LLM training at https://www.adept.ai/blog/sherlock-sdc

walterbell 2 years ago | |

Thanks, does your blog have a working RSS feed?

opisthenar84 2 years ago |

Might be a noob question but for truly important data, couldn't SDCs be detected by using ECC everywhere?

twhitmore 2 years ago |

Interesting. The corruption was in a math.pow() calculation, representing a compressed filesize prior to a file decompression step.

Compressing data, with the increased information density & greater number of CPU instructions involved, seems obviously to increase the exposure to corruption/ bitflips.

What I did wonder was why compress the filesize as an exponent? One would imagine that representing as a floating-point exponent would take lots of cycles, pretty much as many bits, and have nasty precision inaccuracies at larger sizes.

SomeoneFromCA 2 years ago |

Interesting paper, but has some technical errors. First of all, they keep mentioning SRAM+ECC, instead of DRAM+ECC; you cannot use gcj to inspect assembly code generated for Java method, as it will be completely different from the code generated by Hotspot; you do not need all that acrobatics to get disasm of the method, you could just add an infinite loop to the code and attach gdb to the JVM process and inspect the code or dump the core.

MertsA 2 years ago | |

Disclaimer: I work at Meta and I know a couple of the authors of the paper but my work is completely unrelated to the subject of the paper.

That's not a technical error, they mean SRAM in the CPU itself. You're right about gcj but that's kind of a moot point when investigating some reproducible CPU bug like this. The paper mentions all the acrobatics they went through when trying to find the root cause but if gcj would have been practical then it also would have been immediately clear if the gcj output reproduced the error or not. If it didn't reproduce, no big deal, try another approach. You might be right about it being easier to root cause with gdb directly but I'm not so sure. Starting out, you have no idea which instructions under what state are triggering the issue so you'd be looking for a needle in a haystack. A crashdump or gdb doesn't let you bisect that haystack so good luck finding your needle.

SomeoneFromCA 2 years ago | | |

GCJs implementation could be so vastly different from Hotspot, you could as well rewrite it in C and check if it is failing or not. ChatGPT would generate testcase within a minute.

It all depends how good you are with x64 assembly. If you are good enough, you can easily deduce what the instructions at the location do, and can potentially simply copy-paste into an asm file, compile it and check result. Would be much faster to me.

Bluntly speaking, people who are not familiar with low-level debugging make an honest and succesful attempt to investigate a low-level issue. A seasoned kernel developer or reverse engineer would have just used gdb straight away.