Meta quickly detects silent data corruptions at scale

Meta quickly detects silent data corruptions at scale(engineering.fb.com)

167 points by tekkertje 4 years ago | 95 comments

kache_ 4 years ago |

The scale at which Meta operates at really boggles my mind. I work with an ex facebook guy who was on the infra side of things and the numbers he told me.. I couldn't even imagine. And I'm working on the order of magnitude of 100m/h, but still, completely different set of challenges.

silisili 4 years ago | |

Same. I remember asking one guy at FB the process to ask for a new server. He said he can't even open a request for anything less than a thousand boxes. The largest fleet I'd worked on at that point was 12... different worlds.

lclarkmichalek 4 years ago | | |

I mean, that's not true in the general case. That'd be incredibly wasteful.

(Work at Meta, mostly on capacity)

fleddr 4 years ago | |

I once read that Facebook was opening 2 or 3 massive new data-centers in the US for the purpose of hosting stale content.

You may have posted a photo 7 years ago, and statistics show that basically nobody ever revisits it. However, in case you do, it needs to be there. So these enormous buildings do basically nothing, but still need to be there.

It makes me wonder how it can go on like this. Users only keep adding content and never remove it. The income per user cannot grow forever, storage cannot get infinitely cheap, the model has to break one day?

jreese 4 years ago | | |

There's no meaningful benefit to dedicating any amount of DC equipment just to stale content. Those are spindles (and networks) that could be taking meaningful hot reads and writes, and colocating stale and hot data is generally a better use of capacity than concentrating hot data in fewer locations.

cbetti 4 years ago | | |

This isn't how scaling works though. Across all applications the hot data growth outpaces the cold.

So if you're designing capacity for exponential growth, the future point at which you stop experiencing exponential growth and only have to worry about roughly linear growth is a much easier problem to solve.

londons_explore 4 years ago |

In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy!

But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass...

The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads.

Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less.

jeffbee 4 years ago | |

Was pabl12 an actual bad machine? Sounds somehow plausible, as if I heard of it before.

It was an annoying struggle trying to raise the visibility of broken CPUs during my years at Google SRE. The SRE org and the rest of the software side of Tech Infra resisted the whole concept, even though it was well-known among platforms hardware eng. The process for taking a known-bad machine out of service involved 1) the machine being reported independently by three different teams; 2) the machine continuing to be in service for days or weeks, at the leisure of some very asynchronous automation; and 3) the machine being returned immediately to service because it passed all of the cursory checks during reinstall. Really irritating. Consequently every major service had to maintain their own private blacklist.

It's nice to see that some influential people on the software side are starting to come around, with papers like "Cores That Don't Count" etc, but man they could have been on this boat a decade ago.

mjevans 4 years ago | | |

Reminds me of the typical story of someone with a complete damage protection plan and a flaky device. Take it in for repairs, passes all the tests, but they know it's funky, so snap it in half or otherwise completely wreck it right in front of the tech and demand that repair.

bryan_w 4 years ago | | |

Usually teams would consider a machine "bad" if that node in the cluster had elevated errors compared to the rest of the cluster they were running. Unfortunately this doesn't tell hardware teams what actually went wrong.

If one could show that the CPU said 2+2=9, I'm sure they would yank it out right away, but "it returns 500 errors a lot" isn't very debugable. The only thing they can do is run the diag and return it to service if nothing comes up.

notacoward 4 years ago | |

> When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs

At that scale, it's quite likely sent to repair automatically and whoever's on call just gets a notification.

jgrahamc 4 years ago |

Some might enjoy this old Cloudflare debugging story about random crashes in production.

https://blog.cloudflare.com/however-improbable-the-story-of-...

ignoramous 4 years ago | |

Add to that a bunch of "rare" / "unlikely" / "silent" CPU bugs (compute errors) that Google and Facebook see with regularity: https://muratbuffalo.blogspot.com/2021/06/cores-that-dont-co...

> So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time.

https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...

> The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale. They observed that these failures are reproducible and not transient. Then, how come did these CPUs pass the quality control tests by the chip producers? In soft-error based fault injection studies by chip producers, CPU CEEs are evaluated to be a one in a million occurrence, not 1 in 1000 observed at deployment at Facebook and Google... The paper also says that increased density, technology scaling, and wider datapaths increase the probability of silent errors.

dfdz 4 years ago | |

Thanks for sharing.

notacoward 4 years ago |

To be clear, this is about corruption in the CPU/GPU/memory complex. There's a whole separate set of techniques (some of which I worked on) to detect and correct data corruption on disk.

huhtenberg 4 years ago | |

I'm in the same boat and my takeaway is that the vast majority of a "silent" on-disk corruption actually happens on the way to the storage, i.e. the data gets corrupted in some RAM it passes through and then just ends up being written out in corrupted state. This is because, virtually all modern drives implement per-sector FEC coding, so if a bit does flip on the disk, you will either get back original data (now FEC-corrected) or you will get a read error.

That is, the so-called "bitrot" phenomenon is largely mis-attributed. Bitrot doesn't happen at rest. It happens in transit.

notacoward 4 years ago | | |

I can state categorically that bitrot on disk does exist, because that's one of the parts I worked on. It's pretty rare - unfortunately I don't think I can give you the numbers - but across enough exabytes it does happen enough to justify slow scans to detect it.

alipitch 4 years ago | | |

Slightly off-topic digression: This article discussed "enterprise" grade "silent data corruptions".

What are some recommendations for "personal data storage" grade "silent data corruptions"?

"personal data storage" for my case is < 1TB, text / binary files (jpg, mp4).

I am looking at a wikipedia list below, and then searching through hn comments.

<https://en.wikipedia.org/wiki/Comparison_of_file_systems#Blo...> column: Data checksum/ ECC

I found many comments on ZFS, and not so much comments on dm-integrity, BlueStore/Cephfs, and others. So I am thinking of looking into ZFS, but if there are any recommendations, I would like to seek advice.

I am experimenting with Git LFS, git-annex, I like the filesystem UI better, so I am looking for filesystem like solutions.

Hnrobert42 4 years ago |

Interestingly, this site fails ungracefully (HTTP error code 500) when I try to visit from NordVPN, even after cycling through a few IP addresses. I’m noticing more and more sites block all VPN track. I get why, but it’s not good.

pnw 4 years ago |

The first thing I noticed about this article is that like all Facebook pages, it silently corrupted my back button.

mtVessel 4 years ago | |

Did you open it in Firefox? If so, that's the fault of Facebook Container, not the page.

PTOB 4 years ago |

I work on the physical side; building hyperscale datacenters. You guys should try your hand at managing errors in that system. You've got it all: memory leaks, thermal overloads, misallocated heaps, pipes with strong type requirements, dropped packets ... you name it.

Melatonic 4 years ago | |

I would probably be overwhelmed just managing the infrastructure for your monitoring systems and infrastructure is my main thing :-D

mad44 4 years ago |

https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...

HL33tibCe7 4 years ago |

Completely off-topic digression: I still think the name change to “Meta” is a big mistake. Subjectively, for some reason I just really dislike the name. More objectively, the branding is very muddled, e.g: serving an “Engineering at Meta” blog post on fb.com.

Often with these things it’s just about time; it feels wrong because you’re just not used to the change yet. Maybe that will happen, but it’s been months now. Usually with these changes I change my mind quicker than that.

nwsm 4 years ago | |

> the name change to "Meta" is a big mistake

I think it's too soon to tell. Facebook has really negative brand recognition (from my POV), and who knows, maybe "metaverse" style online interaction is the future. (For the record I'm anti-web3 and indifferent on metaverse communities)

CiPHPerCoder 4 years ago | | |

I will always say VR, I will never say "metaverse".

Their branding move was bold, yet unconvincing.

nowherebeen 4 years ago | | |

The name Meta dilutes the brand significantly. I bet if you ask people what Meta is, most people outside tech can't tell. But if you ask what Facebook is, 100% of them can. They took a really good brand name and trashed it to the point they needed to rebrand.

stewbrew 4 years ago | | |

Meta still redirects meta.com to https://about.facebook.com/meta

I don't think it's too soon.

johndfsgdgdfg 4 years ago | |

Can we please keep this type of rants and off-topic criticisms out of technical threads? Lately even reading technical threads has become difficult because of thread-hijacking off-topic rants.

ATsch 4 years ago | |

I feel like, given the negative connotations of "Facebook", that's by design.

nimbius 4 years ago | |

a muddled brand is better than the currently maligned harbinger of misery disinformation and insurrection that Facebook has been mired in. Recruiters at Meta probably appreciate the distance.

Sindisil 4 years ago | | |

How many candidates wouldn't know that Meta == Facebook, at least within the tech spheres?

RoboTeddy 4 years ago |

Computational proofs of integrity (STARKs, SNARKs) could detect silent data corruptions (at the cost of a ~1000x slowdown)

I wonder if we’ll see them used for large scale applications whose correctness is critical.

raphaelj 4 years ago |

It would be better if Meta would focus on detecting spam at scale.

I put a desk chair on Marketplace last Friday, and got 8 messages that were actually scams. These were trying to "schedule" a Fedex/DHL pickup, and would redirect me to fake branded websites that were requesting my personal details and bank account. This was so obviously fake it baffled me Meta can't detect these automatically.

I am also getting multiple message requests per week asking from hookups. These are obviously fake [1].

---

[1] https://imgur.com/a/yZDPh3C

ebbp 4 years ago | |

It’s a different team, with a different skillset, that would be responsible for that. Big companies can focus on more than one thing at a time.

BbzzbB 4 years ago | |

They ban like 1.7B account per quarter ignoring those blocked at registration. Isn't that focus?

Subjectively too I also see so much less bot activity on Facebook than I do on any other social media.

zitterbewegung 4 years ago | |

I think that large tech companies giving a snapshot of what cool or interesting things they do is great but if there are bigger problems that don’t seem to have that kind of focus it just feels like a marketing / recruiting post (which isn’t that bad). But, the problem would be if they made public antispam systems they can’t give that to spammers which presents as a catch22. Also if you have humans in the loop to evade a spam system it is basically impossible .

spookthesunset 4 years ago | |

At the scale of FB, handling fraud is a non-trivial effort. At any given time there are probably thousands of somewhat well funded fraud teams looking to bypass whatever shiny new countermeasure FB adds to their site.

There is a lot of money to be made from defrauding FB users. This monetary incentive results in criminals investing tons of effort into bypassing anti-fraud stuff. It is a non-stop effort of incremental moves on both parties that will carry on for as long as FB users remain a juicy target.

monkeybutton 4 years ago | |

Somehow I knew it was going to be a bit.ly link before opening the image

tupac_speedrap 4 years ago |

Content seems interesting but the generic corporate image at the top, crap font and off-black low contrast text colour is getting on my nerves.

throw03172019 4 years ago | |

Reader mode works great on mobile Safari.