ECC matters

ECC matters(realworldtech.com)

1053 points by rajesh-s 5 years ago | 550 comments

I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.

ksec 5 years ago | |

>I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Did they ( Google ) or He ( Craig Silverstein ) ever officially admit it on record? I did a Google search and results that came up were all on HN. Did they at least make a few PR pieces saying that they are using ECC memory now because I dont see any with searching. Admitting they made a mistake without officially saying it?

I mean the whole world of Server or computer might not need ECC insanity was started entirely because of Google [1] [2] with news and articles published even in the early 00s [3]. And after that it has spread like wildfire and became a common accepted fact that even Google doesn't need ECC. Just like Apple were using custom ARM instruction to achieve their fast JS VM performance became a "fact". ( For the last time, no they didn't ). And proponents of ECC memory has been fighting this misinformation like mad for decades. To the point giving up and only rant about every now and then. [3]

[1] https://blog.codinghorror.com/building-a-computer-the-google...

[2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

[3] https://danluu.com/why-ecc/

djur 5 years ago | | |

Your [3] has a footnote quoting a Google book that reads "Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost... The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM"

sitkack 5 years ago | | |

The fact that ECC isn't the default across everything is a failure of human cognition and Capitalism.

starfallg 5 years ago | |

Recent advances have blurred the lines a bit. The ECC memory that we all know and love is mainly side-band EEC, with the memory bus widened to accommodate the ECC bits driven by the memory controller. However as process size shrink, bit flips become more likely to the point that now many types of memory have on-die EEC, where the error correction is handled internally on the DRAM modules themselves. This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

https://semiengineering.com/what-designers-need-to-know-abou...

There has been a lot of debate regarding this that was summarised in this post -

https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

wtallis 5 years ago | | |

> This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

On-die ECC is going to be a standard feature for DDR5. I'm not aware of any indication that anyone has implemented on-die ECC for DDR4 DRAM, and Hynix at least has made clear statements that on-die ECC is new for their DDR5 and was not present in their DDR4.

tyoma 5 years ago | |

Figure this is as good of a time as any to ask this:

There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?

bsder 5 years ago | | |

This is as old as computing and predates Google.

When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.

DRAM has only gotten worse--not better.

gh02t 5 years ago | | |

The supercomputing community has looked at some of the effect on different parts of the GPU.

https://ieeexplore.ieee.org/abstract/document/7056044

sitkack 5 years ago | | |

Yes.

Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.

This is why filesystems like ZFS exist and storage formats have pervasive checksums.

itisit 5 years ago | |

New Yorker article that credits Jeff Dean and Sanjay Ghemawat with discovering the company’s bitflip issue:

https://www.newyorker.com/magazine/2018/12/10/the-friendship...

grishka 5 years ago | |

I remember reading how someone registered some google domains with a single bit flipped, and saw actual requests coming to them.

andrewstuart2 5 years ago | | |

If you or anybody can remember the source article, that sounds like an interesting read!

Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...

And https://www.researchgate.net/publication/262273269_Bitsquatt...

Faaak 5 years ago | | |

A long time ago I did that with "CDN" domains. I bought ~10 of them (variations of fbcdn, akamai, and ytimg). I _did_ see some traffic (some hits per hour if I remember well), and many of them were from cheap handheld phones (from the user-agent).

NickNameNick 5 years ago | | |

That works for any domain that's busy enough.

Random bit flips happen on client machines and on routers.

If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.

saagarjha 5 years ago | | |

That might just be typos in some cases?

gigatexal 5 years ago | |

I mean early on sure at a startup where you’re not printing money I can see how saving on hardware makes sense. But surely you don’t need an MBA to know that hardware will continue to get cheaper whereas developers and their time will only get more expensive: better to let the hardware deal with it than to burden developers with it ... I’d have made the case for ECC but hindsight being what it is ...

colejohnson66 5 years ago | | |

But if you can save $1M+ now, then throw the cost of fixing it onto the person who replaces you, why do you care? You already got your bonus and jumped ship.

finiteloop 5 years ago | |

One of the best quotes in the Google quotes file an early Googler maintained (I am sure I am screwing it up):

“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes

fragmede 5 years ago | | |

Close!

> I've never thought of defensive programming in terms of adversarial memory.

maria_weber23 5 years ago | |

ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.

Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.

Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.

giantrobot 5 years ago | | |

I don't think ECC is going to give anyone a false sense of security. The issue at Google's scale is they had to spend thousands of person-hours implementing in software what they would have gotten for "free" with ECC RAM. Lacking ECC (and generally using consumer-level hardware) compounded scale and reliability problems or at least made them more expensive than they might otherwise had been.

Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".

AaronFriel 5 years ago | | |

It can't eliminate it but:

1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload

2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.

The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.

saagarjha 5 years ago | | |

There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)

tomxor 5 years ago | | |

> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.

There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!

Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.

slumdev 5 years ago | | |

Error-correcting code (the "ECC" in ECC) is just a quorum at the bit level.

DSingularity 5 years ago | | |

You need two alpha particles hitting the same rank of memory for failure to happen. Although super rare, even then it is still correctable. You need three before it is silent data corruption. Silent corruption is what you get with non ECC with even a single flip.

hn3333 5 years ago | | |

Bit flips can happen, but regardless if they can get repaired by ECC code or not, the OS is notified, iirc. It will signal a corruption to the process that is mapped to the faulty address. I suppose that if the memory contains code, the process is killed (if ECC correction failed).

colejohnson66 5 years ago | | |

> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.

The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.

sobriquet9 5 years ago | | |

If you use multiple computers doing the same calculation and then take the answer from the quorum, how do you ensure the computer that does the comparison is not affected by memory failures? Remember that all queries have to through it, so it has to be comparable in scale and power.

dijit 5 years ago |

I beg this, every time this conversation comes up it’s the same answer “I don’t see a problem”.

It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.

But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.

My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.

cbanek 5 years ago |

As someone who has had to read thousands of random game crash reports from all over the interwebs (you know when Windows says you might want to send that crash log? like that), I totally agree.

Of all the things to be worried about, like OS bugs, bad hardware configuration, etc. bad memory is one of those really troubling things. You look at the code and say "it's can't make it here, because this was set" but when you can't trust your memory you can't trust anything.

And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.

zdw 5 years ago |

Good news is that for DDR5, ECC is a required part of the spec and should be a feature of every module:

https://www.anandtech.com/show/15912/ddr5-specification-rele...

simias 5 years ago |

I used to be pretty skeptical of ECC for consumer-grade hardware, mainly because I felt that I'd always prefer cheaper/more RAM over ECC RAM even if it meant that I'd get a couple of crash every year due to rogue bitflips. For servers it's a different story, but for a desktop I'm fine dealing with some instability for better performance.

But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there's really no good reason not to switch to ECC everywhere.

tokamak-teapot 5 years ago | |

Are there any Ryzen boards that support ECC and actually correct errors?

adrian_b 5 years ago | | |

As others have replied, all the ASRock boards where I have ever checked the specifications do support ECC and also some ASUS boards support ECC, e.g. all ASUS workstation boards.

Because ECC means Error Correcting Code, by definition, any board that claims ECC support must actually correct the errors. The ECC codes used now, with 8 extra bits for each 64 data bits, correct any 1-bit error and detect any 2-bit errors.

Very old computers (25 years old, or more) used parity instead of ECC and they just detected any 1-bit error (and any errors with an odd number of flipped bits), without being able to correct the errors.

gruez 5 years ago | | |

quick search:

https://rog.asus.com/forum/showthread.php?112750-List-Asus-M...

dannyw 5 years ago | | |

Yes, almost all of them correct single-bit flips and detect but do not correct multiple hit flips.

loeg 5 years ago | | |

Yes. E.g., all ASRock boards.

fulafel 5 years ago | | |

The functionality seems to all be in the memory controller integrated to the CPU.

fctorial 5 years ago | |

> cheaper/more RAM

It's faster too.

ekianjo 5 years ago | |

> no good reason not to switch to ECC everywhere.

Not all CPUs support ECC however.

josefx 5 years ago | | |

Just Intel fucking over security by making ECC a non feature on consumer grade hardware - wouldn't be surprised if it was just a single bit flipped in a feature mask.

loeg 5 years ago | | |

(Intel)

otterley 5 years ago |

About 1/3 of Google's machines and 8% of Google's DIMMs in their fleet suffer at least one correctible memory error per year: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

jjeaff 5 years ago | |

Which means, assuming google is running very large machines with lots of memory that one might expect a single correctable error once every 6-10 years on your average workstation of small server. That's generously assuming your workstation has 1/3 as much memory as the average google server.

Nebasuke 5 years ago | | |

Google does not use very large or even large machines for most of their fleet. You can quickly see in the paper this is for 1, 2, and 4 GB RAM machines (in 2006-2008).

tpetry 5 years ago | | |

With a single bit flip on 8% of the dimms you only need 12.5 dimms in your workstation to have one bit flip every year. Not everyone has that much dimms, but at least 4 is pretty normal. So in average every 3 years for every workstation.

But i don‘t know how relevant these metrics from 2009 are. Did memory got better or worse compared to 2009 for bit flips?

petermcneeley 5 years ago |

I would also add that Row Hammer Attacks are much harder on ECC.

When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.

https://en.wikipedia.org/wiki/Row_hammer

kensai 5 years ago |

“ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.”

Its.

There, I finally corrected Linus Torvalds in something. :))

MarkusWandel 5 years ago |

This is one justified Linus rant! My personal history includes data loss twice because of defective RAM, and many more RAMs discarded after the now obligatory overnight run of MemTest86+ (these were all secondhand RAMs - I would never buy a new one without a refund guarantee). My very first "PC" still had the ECC capability and I used it. My own now very dated rant on the subject: http://wandel.ca/homepage/memory_rant.html

mixmastamyk 5 years ago | |

A few years back memtest86 wouldn’t run on newer machines, has that been fixed?

MarkusWandel 5 years ago | | |

Wouldn't know, I don't run newer machines. But since it's a boot option on Fedora disks, I imagine it would run.

salmon 5 years ago | |

You bought used RAM DIMMs and were surprised that they failed?

MarkusWandel 5 years ago | | |

Used computers that have RAM in them. But as I wrote, two of those computers were brand new with new RAMs in them.

otterley 5 years ago |

D. J. Bernstein (of qmail/daemontools fame) spoke of it over a decade ago as well. https://cr.yp.to/hardware/ecc.html

slim 5 years ago | |

these days he's more famous for the NaCl crypto library

loup-vaillant 5 years ago | | |

For which bit flips are even more relevant: EdDSA has this nasty tendency of leaking the private key if the wrong bits are flipped (there are papers on fault injection attacks). People who sign lots of stuff all the time, say Let's Encrypt, could conceivably gain some piece of mind with ECC.

(Note: EdDSA is still much much better than ECDSA, most notably because it's easier to implement correctly.)

linsomniac 5 years ago |

This reminds me of last year we ordered a new $14K server, it arrived and we ran it through our burn-in process which included running memtest86 on it, and it would, after around 7 hours, generate errors.

Support was only interested if their built-in memory tester, which even on it's most thorough, would only run for ~3 hours, would show errors, which it wouldn't. IIRC, the BMC was logging "correctable memory errors", but I may be misremembering that.

"We've run this test on every server we've gotten from you, including several others that were exactly the same config as this, this is the only one that's ever thrown errors". Usually support is really great, but they really didn't care in this case.

We finally contacted sales. "Uh, how long do we have to return this server for a refund?" All of a sudden support was willing to ship us out a replacement memory module (memtest86 identified which slot was having the problem), which resolved the problem.

They were all too willing to have us go to production relying on ECC to handle the memory error.

scottlamb 5 years ago | |

> They were all too willing to have us go to production relying on ECC to handle the memory error.

Good call in not accepting this. Even ignoring the possibility you have a double-bit error that causes a crash, or a triple-bit error that maybe can't be detected, frequent ECC errors are problematic. I've encountered machines that consistently ran my software horribly slowly. I don't remember specifics, but let's say at least 100X latency of other machines for similar operations. When I dug in, I found these machines had a huge amount of correctable memory errors. The correction apparently degrades performance significantly. I'm not sure exactly why, but I guess there's an MCE trap to report the memory error, and perhaps that path is slow.

dboreham 5 years ago |

You don't need to look at kernel crashes to speculate about bus and memory errors -- just check the logs on a few systems that do have ecc. Pretty soon you'll see correctable errors being reported.

maddyboo 5 years ago | |

I don’t know much about this topic, but is it possible that ECC memory is more prone to single bit errors than non-ECC memory because there is less pressure on companies to minimize such errors? If this were the case, it would skew the data.

justin66 5 years ago | | |

There are 12.5% more memory cells for a given module size, which equals more targets to possibly be flipped by cosmic rays. It’s not crazy to think that modules of equivalent quality (same brand, same chip part numbers) would experience a greater incidence of that kind of single bit flip (which would be corrected on the ECC modules). If a manufacturer were shipping chips prone to bit flipping because of slightly radioactive packaging, as happened at times in the past, you might see something similar.

But you’ve got it backwards about the incentives. A manufacturer has less incentive to deliberately ship a defective part in the case of ECC modules. If the modules consistently log ECC errors, they can easily be identified and returned under warranty to the manufacturer. A consumer is much less likely to identify an intermittent problem with a non-ECC part.

JoeAltmaier 5 years ago |

ECC works if done right. Accessing a memory location can fix bit-flips (ECC is a 'correcting' code). But systems that don't regularly visit every memory location, can accumulate risk. Those dark corners of RAM can eventually get double-bit errors and be uncorrectable. So an OS might 'wash' RAM during idle moments, reading every location in a round-robin manner to get ECC to kick in and auto-correct. Doesn't matter how fast (1M every hour or whatever) as long as somehow ECC has a chance to work.

jkuria 5 years ago |

For those, like me, wondering what ECC is, here's an explanation:

https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...

KingMachiavelli 5 years ago |

Is there such a thing as 'software' ECC where a segment in memory also has a checksum stored in memory and the CPU just verifies it when the memory segment is accessed?

It would be a lot slower than real ECC but it could just be used for operations that would be especially vulnerable to bit flips. It would also not know for certain if the memory segment of data or the memory segment holding the checksum was corrupted besides their relative sizes (checksum is much smaller so more unlikely to have had a bit flip in it's memory region).

a1369209993 5 years ago | |

Actually... there is a word of memory that you already have to read every time you access a region of memory: the page table entry for that region. If you have 64-byte cache lines, that's 64 lines per (4KB) page, so you could load a second 64-bit word from the page table[0], and use that as a parity bit for each cache line, storing it back on write the same way you store active and dirty bits in the PTE proper. Actual E[correcting]C would require inflating the effective PTEs from 8(orginal)-16(parity) bytes to about 64(7 bits per line, insufficient)-128(15, excessive), which is probably untenable, but you could at least get parity checks this way.

There's also the obvious tactic of just storing every logical 64-bit word as 128 bits of physical memory, which gives you room for all kinds of crap[1], at the expense of halving your effective memory and memory bandwidth.

0: This is extremely cheap since you're loading a 64- vs 128-bit value, with no extra round trip time and still fits in a cache line, so you're likely just paying extra memory use from larger page tables.

1: Offhand, I think you could fit triple or even quadruple error correction into that kind of space (there's room for eight layers of SECDED, but I don't remember how well bit-level ECC scales).

temac 5 years ago | |

Intel has some recent patents on that.

freeqaz 5 years ago |

I bought ECC RAM for my laptop and it definitely was about 4x the price. It's valuable to me for a few reasons -- peace of mind being a big one.

Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!

bitcharmer 5 years ago | |

This is the first time I hear about a laptop that supports ECC memory. Could you please share the make and model?

dijit 5 years ago | | |

I have a Dell Precision 5520 (chassis of an XPS 15) which has a Xeon and ECC memory.

Finding a memory upgrade seems difficult though.

lb1lf 5 years ago | | |

-My boss has a Xeon Dell - a 7550, methinks - luggable.

It is filled to the gunwales with ECC RAM.

Cost him the equivalent of $7k or so. Eeek.

xxs 5 years ago | | |

Lenovo has Xeon laptops[0], and technically Intel used to support ECC on i3 (and celeron, etc.)

0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/Thi...

bluedino 5 years ago | | |

Lenovo (P series) and HP workstation models also support ECC

temac 5 years ago | |

Note that the price is mostly due to market segmentation, in your case most of it by the laptop vendor (of course some for Intel, but not that much compared to the laptop vendor)

Xeon with ECC are not that overpriced compared with similar Core without. Likewise, RAM sticks with ECC are cheap to produce (basically just one more chip to populate per side per module). Likewise soldered RAM would simply add maybe $10 or $20 of extra chips.

washadjeffmad 5 years ago | |

For the price, it made more sense for me to buy an R630 and populate it with a few less expensive, higher capacity ECC RDIMMs. I don't really need ECC as a local feature, so this lets me run on the mobile I want.

jjeaff 5 years ago | |

You should be able to check logs for corrected errors, right?

I'm guessing you won't find any.

phh 5 years ago |

I don't know if ECC is that important, but reliability of RAM (or any storage) feels pretty crazy to me. 128GB being refreshed every second for a month error requires that the per-bit refresh process has a reliability of 99.9999999999999999% to be flawless. Considering we are dealing with quantum effects (which are inherently probabilistic), I wouldn't trust myself to design anything like that.

Now back to ECC, I'll probably be corrected, but I don't think ECC helps gain more than two order of magnitudes, so we still need incredibly reliable RAM. If we move to ECC RAM by default everywhere, aren't we simply going to get less reliable RAM at the end?

johnklos 5 years ago |

From the fortune database:

As far as we know, our computer has never had an undetected error. -- Weisert

londons_explore 5 years ago |

I simply care that my computer executes code perfectly. Let's settle on "one instance of unintended behaviour per hundred years" for that metric.

If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that's fine too.

Just meet the reliability spec - I don't care how.

simias 5 years ago | |

Then you'll have to pay a huge primer for that privilege. I can assure you that your standard computer components are not rated for century-scale use.

That's why I've always been on the fence with this ECC thing. For servers it's vital because you need stability and security.

For desktops I think that for a long time it was fine without ECC. If I have to chose between having, say, 30% more RAM or avoid a potential crash once a year, I'll probably take the additional RAM.

The problem is that now these problem can be exploited by malicious code instead of just merely happening because of cosmic rays. That's the main argument in favour of ECC IMO, the rest is just a tradeoff to consider.

ClumsyPilot 5 years ago | | |

But it isn't just a crash, it's also silent data corruption that will never be detected

loup-vaillant 5 years ago | | |

> I can assure you that your standard computer components are not rated for century-scale use.

And that's probably not what GP asked for. There's a difference between guaranteeing an error rate of 1 error per century of use on average, and guaranteeing it over the course of an actual century. It might be okay to guarantee that error rate for only 5 years of uninterrupted use, and degrade after that. For instance:

  Years  1- 5:  1 error  per century.
  Years  6-10:  3 errors per century.
  Years 10-15: 10 errors per century.
  Years 15-20: 20 errors per century.
  Years 20-30:  1 error  per *year*.
  Years 30+  : the chip is broken.

Now, given how energy hungry and polluting the whole computer industry actually is, it might be a good idea to shoot for extreme durability and reliability anyway. Say, sustain 1 error per century, over the course of fifty years. It will be slower and more expensive, but at least it won't burn the planet as fast as our current electronics.

temac 5 years ago | |

In "theory" it needs ECC because you must also protect the link between the CPU and the RAM. So with ECC fully in DRAM but no protection on the bus, you risk some errors during the transfer. However maybe this kind of errors are rare enough so that you would have less than one per century. It probably depends on the motherboard design and fabrication quality though, and the environment where it is used.

paulie_a 5 years ago |

There was a great defcon talk a while back regarding using ECC. The concept was called "dns jitter"

Basically you can register domains using small bit differences for domains and start getting email and such for that domain

If I recall correctly the example given was a variation of microsoft.com

All because so much equipment doesn't use ECC

jeffbee 5 years ago | |

miclosoft.com is only one bit away from microsoft.com. Used to see these problems all the time when I worked on gmail.

At Google even with ECC everywhere there wasn't enough systematic error detection and correction to prevent the global database of monitoring metrics from filling up with garbage. /rpc/server/count was supposed to exist but also in there would be /lpc/server/count and /rpc/sdrver/count and every other thing. Reminded me daily of the terrors of flipped bits.

thu2111 5 years ago | | |

Ahaha. Reminds me of when I worked there. One day a large service tanked in some datacenter because BigTable replication in that location just stopped. Digging in, it turned out the BigTable should have been replicating from YQ but had started trying to use QQ instead, which didn't exist. Q being one bit away from Y. Or it was something like that, I don't remember exactly. There'd been a bit flip in the exact part of memory that contained the name of the database cluster to replicate from!

zx2c4 5 years ago | |

Voila http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabur...

tyoma 5 years ago | | |

There were some great follow up talks as well! It turns out a viable attack vector was also MX records. And there was the guy who registered kremlin.re ( versus kremlin.ru ).

MAXPOOL 5 years ago |

Well shit.

I run some large ML models in my home PC and I get NaN's and some out of range floats every month or so. I have spent hours debugging but doing the same computation with the same random seeds does not recreate the problem.

How about GPU's and their GDDR SDRAM? Do they have parity bits?

layer8 5 years ago | |

Some pro-level Nvidia GPUs have ECC RAM, they are very expensive though. I don’t think regular gaming GPUs have parity, due to the extra cost, performance impact (probably minor but measurable) and irrelevance for gaming.

vbezhenar 5 years ago | | |

Cheap pro-level GPUs don't have ECC RAM either. And it's not easy to find out, it might be buried somewhere.

spacedcowboy 5 years ago |

Seems likely that “bad ram” was the reason for the recent AT&T fiber issues, given that 1 bit was being flipped reliably in data packets [1]

[1]: https://twitter.com/catfish_man/status/1335373029245775872?l...

p_l 5 years ago | |

I have had in the past encountered an issue where line card was stripping exactly one bit of address data. Don't know of the follow up investigation, but it probably wasn't TCAM

SV_BubbleTime 5 years ago | |

I think you meant seems unlikely

louwrentius 5 years ago |

ECC matters, even on the desktop, it's not even a discussion, to me.

If you think it doesn't matter: how do you know? If you don't run with ECC memory, you'll never know if memory was corrupted (and recovered).

That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.

Who knows.

I'll tell you, who knows. God damn every sysadmin (or the modern equivalent) can tell you how often they get ECC errors. And at even a small scale you'll encounter them. I have, on servers and even on an SAN Storage controller, for crying out loud.

If you care about your data, use ECC memory in your computers.

knorker 5 years ago |

I have multiple times postponed buying new computers for YEARS, because I'm waiting for intel to get their head out of their ass and actually let me buy something that does ECC for desktop. (incl laptops)

I would have bought computers when I "wanted one". Now I buy them when I need one. Because buying a non-ECC computer just feels like buying a defective product.

In the last 10 years I would have bought TWICE as many computers if they hadn't segmented their market.

Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.

skibbityboop 5 years ago | |

Have you finally stopped buying Intel? Current Ryzens are a much better CPU anyhow, just dump Intel and be happy with your ECC and everything else.

knorker 5 years ago | | |

I'm in the market for a new laptop (since a few years). Is there something like the X1 carbon but with ECC?

vbezhenar 5 years ago | |

There are plenty of Xeons which are suitable for desktops and there are plenty of laptops with Xeons.

Price is not nice though.

1996 5 years ago |

Linus is absolutely right.

I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can't get that, at all - even without the other fancy things I would like such as a 4k OLED with pen/touchscreen.

In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)

I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there's demand for "high end yet non bulky laptops"

miahi 5 years ago | |

Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and optional pen) and up to 128GB ECC RAM if you choose the Xeon processor. It's big and heavy, but it exists.

I hope AMD will create a better market for the ECC laptop memory (right now it's hard to find + expensive).

1996 5 years ago | | |

I know- I had my eye on this very model, as you can even add a mSata on the WWAN slot to get a 4th drive.

Unfortunately, Lenovo is not selling the P53 anymore, which is exactly why I say I can't get that even in a "bulky" version.

IgorPartola 5 years ago |

I wish this was more of a cohesive argument. He says he thinks it’s important and points to row-hammer problems but doesn’t explain why. Probably because the audience it was written for already knows the arguments of why, but this is not the best argument.

If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).

eloy 5 years ago | |

He does explain it:

> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.

It might be false, but I think it's a reasonable assumption.

IgorPartola 5 years ago | | |

To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.

turminal 5 years ago | |

It's a message in a thread from a technological forum. I think its intended audience are people already familiar with ECC unlike here on HN.

IgorPartola 5 years ago | | |

Exactly my point :)

tgbugs 5 years ago |

A relevant Bryan Cantrill talk segment on this, which heightens the paranoia around this. Namely, firmware hiding correctable errors and only reporting uncorrectable errors.

https://www.youtube.com/watch?t=2104&v=fE2KDzZaxvE

type0 5 years ago |

Consumer awareness about ECC needs to be better, with recent security implications I simply can't understand why more motherboard manufacturers don't support it on AMD. Intel of course is all to blame on the blue side, I stopped buying their overpriced Xeons because of this.

rajesh-s 5 years ago | |

Good point on the need for awareness!

The industry has convinced the average user of consumer hardware that PPA (Power,Performance,Area) is all that needs to get better with generational improvements. Hoping that the concerning aspects of security and reliability that have come to light in the recent past changes this.

kozak 5 years ago |

I'm about to write some code that will allocate a random buffer, take a checksum of it, and just sit on the buffer, periodically checksuming it again until a bit flips. Or maybe even allocate a buffer of zeros and wait until a non-zero appears in it.

FartyMcFarter 5 years ago |

Does anyone know why ECC memory requires the CPU to support it?

Naively, I can understand why error reporting has dependencies on other parts of the system, but it would seem possible for error correction to work transparently.

toast0 5 years ago | |

As implemented today, ECC is a feature of the memory controller. You need special ram, because instead of 8 parallel rams per bank, you need 9, and all the extra data lines to go to the controller.

Modern CPUs have integrated memory controllers, so that's why the CPU needs to support it.

Correction without reporting isn't great; anyway, you need a reporting mechanism for uncorrectable errors, or all you've done is ensure any memory errors you do experience are worse.

fomine3 5 years ago | | |

Error correcting and reporting is better, but even only correcting is better than non-ECC. I wonder this compromise could be accepted by Intel.

TomVDB 5 years ago | |

I think the memory just provides additional storage bits to detect the issue, but doesn't contain the logic.

This is in line with all technical parameters of DRAM: everything must be as cheap as possible, and all the difficult parts are moved to the memory controller.

Which is the right thing to do, because you can share one memory controller with multiple DRAM chips.

wmf 5 years ago | |

Historically the detection and correction is performed in the memory controller not the DRAM.

vlovich123 5 years ago |

A couple of years ago there was advancements that claimed to make Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a concern for some reason?

I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and/or guard pages.

[1] https://www.zdnet.com/article/rowhammer-attacks-can-now-bypa...

theevilsharpie 5 years ago | |

ECC isn't a direct mitigation against Rowhammer attacks, as memory errors caused by three or more flipped bits would still go undetected (unless you're using ChipKill, but that's a rare setup).

However, flipped three bits simultaneously isn't trivial, and the attempts that flip fewer bits will be detected and logged.

GregarianChild 5 years ago | | |

Isn't ChipKill just another form of ECC? If so there is a number of bitflips that ChipKill can no longer correct / detect. [1] seems to say that they observed some flips in dRAM with ChipKill, although the paper is a bit vague here.

[1] B. Schroeder et al, DRAM Errors in the Wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

rajesh-s 5 years ago | | |

Right! Section 1.3 of this publication discusses possible mitigations for the row hammer problem and where ECC fits in

https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf

amelius 5 years ago |

Does Apple use ECC in its M1 laptop?

dijit 5 years ago | |

No. It uses a unified package of LPDDR4x SDRAM

my123 5 years ago | | |

LPDDR4X systems with ECC exist, but it indeed looks like Apple M1 systems aren't one...

graeme 5 years ago | |

This is my one worry. I have an imac pro and anecdotally it has been a LOT more reliable than my old macbook pro. The imac pro has ecc.

alexwillner 5 years ago | |

At least some kernel log messages imply that the M1 might support ECC:

https://eclecticlight.co/2020/12/09/what-happens-when-an-m1-...

greyhair 5 years ago |

ECC is required on mission critical hardware.

I have spent 36 years fielding embedded devices in core network (D1/E1, SONET, ROADM/MPLS, Cellular basestation) and I will tell you that large ECC covered memory arrays always show small numbers of correctable error events over the course of a year. I have seen, over the course of my career, exactly one controller card replaced early in the field, because it started throwing excessive recoverable ECC events over time, until it hit a threshold of 10x the average of a typical board. On the order of ten recoverable ECC events per month instead of one event per month. I have never observed a logged non-correctable ECC event in the field. In the lab, yes, but never in fielded equipment.

If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don't need ECC. That is the question you need to answer.

For mission critical systems? ECC is a requirement.

willis936 5 years ago |

Whenever this topic comes up I wonder how much more resilient are CPU registers compared to DRAM.

MisterTea 5 years ago |

> ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.

The phrase that strikes me is "horribly bad market segmentation". I agree 100%.

Remember when the Pentium/pro/2/3 could operate in single and dual socket configurations with ECC? The same CPU that plugged into your low end consumer board could also plug into a high end server/workstation board. All you needed was the right motherboard.

_0ffh 5 years ago |

Please someone correct me if I'm wrong, but as far as I can remember memory with extra capacity for error detection used to be a rather common thing on early PCs. That really only changed a couple of decades in, in order to be able to offer lower prices to home users who didn't know or care about the difference. Probably about the time, or earlier, when with some hard disk manufacturers megabytes suddenly shrunk to 10^6 bytes (before kibibytes or mebibytes where a thing, btw).

wmf 5 years ago | |

Yes, PCs used to use parity memory.

_0ffh 5 years ago | | |

That's the name I couldn't quite recover from my memory when I asked, exactly!

wicket 5 years ago |

Over the years, I don't think I've ever been able to explain to anyone that their memory error could have been caused a cosmic ray without being laughed at.

mauri870 5 years ago |

In case the page os not loading, refer to the wayback machine[1] for a copy

[1] https://web.archive.org/web/*/https://www.realworldtech.com/...

jhoechtl 5 years ago |

I definitely do not want Linus Torvalds yelling at me in that tone --- but reading his utterings is certainly entertaining.

aborsy 5 years ago |

For the average user, what’s the impact of bit flips in memory in practical terms?

I am not talking about servers dealing with critical data.

Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.

Would I notice a difference between these two copies after 5-10 years?

That tells us if ECC matters for most people.

theevilsharpie 5 years ago | |

> For the average user, what’s the impact of bit flips in memory in practical terms?

The most likely impact (other than nothing, if bits are flipped in unused memory) is program crashes or system lock-ups for no apparent reason.

throwaway9870 5 years ago | |

This isn't about disk storage, this is about DRAM. A bit flip in DRAM might corrupt data, but could also cause random crashes and system hangs. That generally matters to everyone.

arendtio 5 years ago |

It would be interesting to see how many more kernel oops appear on machines without ECC compared to those with ECC.

indolering 5 years ago |

My favorite example is a bit flip altering election results:

https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...

trissylegs 5 years ago |

When I chose my PC parts when Ryzen first came out I tried to get ECC parts. The RAM was obtainable, the problem was that no motherboards had ECC support at the time. I hope the situation has improved by the time I get my next motherboard/cpu upgrade.

elgfare 5 years ago |

For those out of the loop like me, ECC does indeed stand for error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory

nix23 5 years ago |

I always have that conversation when ZFS comes up. Some peoples think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every single one FS in Linux. And every single reliable Machine needs ECC.

ratiolat 5 years ago |

I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600

I specifically was looking for bang for buck, low(er) wattage and ECC.

IanCutress 5 years ago | |

Those AMD motherboards with consumer CPUs are a bit iffy. They run ECC memory, but it's hard to tell if it is running in ECC mode. Even some of the tools that identify ECC is running will say it is, even when it isn't, because the motherboard will report it is, even when it isn't. ECC isn't a qualified metric on the consumer boards, hence all the confusion.

unixhero 5 years ago |

Fantastic burn by Linus Torvalds whom also had some skin in the CPU game.

Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)

Noxmiles 5 years ago |

I was reading it and thought: wow, this guy is absolutely right! Great things he's talking about. After reading it, i saw it was Linux Torvalds :D

raghavtoshniwal 5 years ago |

Once trained a GPT2 model to do text-gen on Linus’ emails. Boy there were some choice angry rants and non-sensical technical jargon that was generated

z3t4 5 years ago |

Memory often comes with lifetime guarantees. If they had ECC it would be much easier to detect bad memory...

JumpCrisscross 5 years ago |

What is the status of ECC on Macs?

CalChris 5 years ago | |

iMac Pro which has Xeon M. There's a good chance that will go away with the new Apple Silicon iMac Pro due out this year. MacRumors roundup article doesn't mention ECC.

https://www.macrumors.com/roundup/imac/

qwerty456127 5 years ago |

ECC should be everywhere. It seems outrageous to me almost no laptops have ECC.

belzebalex 5 years ago |

Asked myself, would it be possible to build a Geiger counter with RAM?

rafaelturk 5 years ago |

Little bit offtopic: Again seems that Intel? what?! is the one lowering the bar.

b0rsuk 5 years ago |

I browsed some online listings for ECC memory modules, and they seem to be sold one module at a time. Standard DDR4 modules are sold in pairs, to benefit from dual channel mode.

Does ECC memory support dual channel??

srtjstjsj 5 years ago |

I guess Linus's recent project to communicate more respectfully didn't pan out.

musingsole 5 years ago |

It's a shame we don't have ECC for individuals. How many of society's bugs come from someone wandering around with a bit flipped?

rahimiali 5 years ago |

I have trouble parsing information from this rant. Is someone willing to translate this into an argument (a string of facts tied by logical steps)?

mark-r 5 years ago | |

1. Linux sometimes has crashes, not due to software errors but because of memory glitches. 2. ECC would prevent memory glitches. 3. ECC is hard to find on desktop PCs because Intel uses the feature to differentiate desktop CPUs from server CPUs, so it can charge more for servers. 4. Even when someone like AMD makes the feature available, the market doesn't have ECC DRAM modules or motherboards readily available because Intel killed the demand for it.

wagslane 5 years ago |

It really does. I did a write-up recently on it as I was diving in and understanding the benefits: https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu...

avianes 5 years ago | |

Be careful not to confuse ECC memory with ECC encryption.

ECC memory = memory with Error-Correcting Code

ECC encryption = Elliptic Curve Cryptography

sally1620 5 years ago |

Linux is accusing Intel of killing ECC intentionally. But that is not really the case, they just wanted people to pay up.

If you care about ECC, you pay for Xeon. Majority of consumers don't run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.

AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.

Dylan16807 5 years ago | |

Intel had to kill consumer ECC as part of making it a feature that people can "pay up" for. That's very intentional.

> AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

You are correct to call them a corporation. AMD is not your friend, but they are the good actor in this fight.

fomine3 5 years ago | |

ECC isn't enough to be bulletproof but improves reliability for well known relatively unreliable parts. Extra theoretical cost for ECC should be accepted for most of computer users. It also helps developing cheaper RAM technology (see what's happened on SSD).

sys_64738 5 years ago |

ECC memory is predominantly used in servers where failure absolutely must be identified and logged. The desktop market to a lesser extent due to lack of mission critical tasks being run from there.

dijit 5 years ago | |

There are situations though, where you’re working on a document and the documents “save” format is a memory dump. Corruption for things of that type (Adobe RAW for example) would remove data.

It might present itself as a 1pixel colour difference, but it could be more damaging (incorrect finances, in accounting software for example). Software trusts memory; but memory can lie.

That’s dangerous.

MaxBarraclough 5 years ago | | |

That's an interesting point. In an extreme case, an order or money transfer might be placed for an incorrect quantity, or to an incorrect recipient.

projektfu 5 years ago | | |

Perhaps consumer-grade software that needs guarantees of correctness should be using error correction in software. For example, database records for financial software, DNS, e-mail addresses, etc.

jkbbwr 5 years ago | | |

To be fair, if your save mechanism is just a straight memory dump with no checksums and validation. You have bigger issues.

sys_64738 5 years ago | | |

Those corner cases might occur rarely but are probably inconsequential given rate of occurrence versus rate of criticalness - it probably doesn't justify the markup for most. In a data center you're processing millions of transactions per minute so occurrence is much more impactful.

mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow2: 0 Uncorrected Errors mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors mc0: csrow3: 0 Uncorrected Errors mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors edac-util: No errors to report.