ECC matters(realworldtech.com) |
ECC matters(realworldtech.com) |
Does ECC memory support dual channel??
ECC memory = memory with Error-Correcting Code
ECC encryption = Elliptic Curve Cryptography
If you care about ECC, you pay for Xeon. Majority of consumers don't run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.
AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.
Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.
> AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.
You are correct to call them a corporation. AMD is not your friend, but they are the good actor in this fight.
It might present itself as a 1pixel colour difference, but it could be more damaging (incorrect finances, in accounting software for example). Software trusts memory; but memory can lie.
That’s dangerous.
If my data is fine being corrupted to save 12.5% on RAM costs, then why am I even bothering processing the data? Apparently it's worthless.
People today weigh the cost of maybe 16 vs 32GB on a mid-tier desktop. ~doubling the cost for twice the RAM. Yes, paying 12.5% more for ECC RAM is a no-brainer.
Or after it's been checked. (time-of-check vs time-of-use)
If you want to detect a bit flip, use parity.
I guess in theory some software could produce signed data in CPU cache, and "commit" it to RAM as a verified block.
But the overhead would be enormous. Would you slow down your CPU by half in order to not pay 12.5% more for RAM?
Hmm, I wonder what SGX and similar do about this.
If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).
> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.
It might be false, but I think it's a reasonable assumption.
The industry has convinced the average user of consumer hardware that PPA (Power,Performance,Area) is all that needs to get better with generational improvements. Hoping that the concerning aspects of security and reliability that have come to light in the recent past changes this.
Naively, I can understand why error reporting has dependencies on other parts of the system, but it would seem possible for error correction to work transparently.
Modern CPUs have integrated memory controllers, so that's why the CPU needs to support it.
Correction without reporting isn't great; anyway, you need a reporting mechanism for uncorrectable errors, or all you've done is ensure any memory errors you do experience are worse.
This is in line with all technical parameters of DRAM: everything must be as cheap as possible, and all the difficult parts are moved to the memory controller.
Which is the right thing to do, because you can share one memory controller with multiple DRAM chips.
I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and/or guard pages.
[1] https://www.zdnet.com/article/rowhammer-attacks-can-now-bypa...
However, flipped three bits simultaneously isn't trivial, and the attempts that flip fewer bits will be detected and logged.
[1] B. Schroeder et al, DRAM Errors in the Wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
https://eclecticlight.co/2020/12/09/what-happens-when-an-m1-...
I have spent 36 years fielding embedded devices in core network (D1/E1, SONET, ROADM/MPLS, Cellular basestation) and I will tell you that large ECC covered memory arrays always show small numbers of correctable error events over the course of a year. I have seen, over the course of my career, exactly one controller card replaced early in the field, because it started throwing excessive recoverable ECC events over time, until it hit a threshold of 10x the average of a typical board. On the order of ten recoverable ECC events per month instead of one event per month. I have never observed a logged non-correctable ECC event in the field. In the lab, yes, but never in fielded equipment.
If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don't need ECC. That is the question you need to answer.
For mission critical systems? ECC is a requirement.
The phrase that strikes me is "horribly bad market segmentation". I agree 100%.
Remember when the Pentium/pro/2/3 could operate in single and dual socket configurations with ECC? The same CPU that plugged into your low end consumer board could also plug into a high end server/workstation board. All you needed was the right motherboard.
[1] https://web.archive.org/web/*/https://www.realworldtech.com/...
I am not talking about servers dealing with critical data.
Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.
Would I notice a difference between these two copies after 5-10 years?
That tells us if ECC matters for most people.
The most likely impact (other than nothing, if bits are flipped in unused memory) is program crashes or system lock-ups for no apparent reason.
https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...
I specifically was looking for bang for buck, low(er) wattage and ECC.
Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)
[1] J. S. Kim et al, Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques https://arxiv.org/abs/2005.13121
I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.
It's the "once every other year" type of bitflip that's the problem. The proverbial "cosmic ray" hitting your DRAM and flipping a bit. That will be caught by ECC but it'll most likely remain a total mystery if it causes your non-ECC hardware to crash.
> A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year
Then one of them brought in his personal Geiger counter and found the radiation coming off the steel in that rack case was significantly higher than background.
You may never know when the metal you use was recycled from something used to hold radioactive materials.
Hi alphabet lawyers.
They argued that 'Google' has now become a verb meaning 'to search the Internet for' and as such alphabet should have the name taken away.
Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.
Did they ( Google ) or He ( Craig Silverstein ) ever officially admit it on record? I did a Google search and results that came up were all on HN. Did they at least make a few PR pieces saying that they are using ECC memory now because I dont see any with searching. Admitting they made a mistake without officially saying it?
I mean the whole world of Server or computer might not need ECC insanity was started entirely because of Google [1] [2] with news and articles published even in the early 00s [3]. And after that it has spread like wildfire and became a common accepted fact that even Google doesn't need ECC. Just like Apple were using custom ARM instruction to achieve their fast JS VM performance became a "fact". ( For the last time, no they didn't ). And proponents of ECC memory has been fighting this misinformation like mad for decades. To the point giving up and only rant about every now and then. [3]
[1] https://blog.codinghorror.com/building-a-computer-the-google...
https://semiengineering.com/what-designers-need-to-know-abou...
There has been a lot of debate regarding this that was summarised in this post -
On-die ECC is going to be a standard feature for DDR5. I'm not aware of any indication that anyone has implemented on-die ECC for DDR4 DRAM, and Hynix at least has made clear statements that on-die ECC is new for their DDR5 and was not present in their DDR4.
There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?
When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.
DRAM has only gotten worse--not better.
Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.
This is why filesystems like ZFS exist and storage formats have pervasive checksums.
https://www.newyorker.com/magazine/2018/12/10/the-friendship...
Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...
And https://www.researchgate.net/publication/262273269_Bitsquatt...
Random bit flips happen on client machines and on routers.
If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.
“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes
> I've never thought of defensive programming in terms of adversarial memory.
Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.
Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.
Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".
1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload
2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.
The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?
I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.
There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!
Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.
The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.
It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.
But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.
My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.
Of all the things to be worried about, like OS bugs, bad hardware configuration, etc. bad memory is one of those really troubling things. You look at the code and say "it's can't make it here, because this was set" but when you can't trust your memory you can't trust anything.
And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.
https://www.anandtech.com/show/15912/ddr5-specification-rele...
But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there's really no good reason not to switch to ECC everywhere.
Because ECC means Error Correcting Code, by definition, any board that claims ECC support must actually correct the errors. The ECC codes used now, with 8 extra bits for each 64 data bits, correct any 1-bit error and detect any 2-bit errors.
Very old computers (25 years old, or more) used parity instead of ECC and they just detected any 1-bit error (and any errors with an odd number of flipped bits), without being able to correct the errors.
It's faster too.
Not all CPUs support ECC however.
But i don‘t know how relevant these metrics from 2009 are. Did memory got better or worse compared to 2009 for bit flips?
When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.
Its.
There, I finally corrected Linus Torvalds in something. :))
(Note: EdDSA is still much much better than ECDSA, most notably because it's easier to implement correctly.)
Support was only interested if their built-in memory tester, which even on it's most thorough, would only run for ~3 hours, would show errors, which it wouldn't. IIRC, the BMC was logging "correctable memory errors", but I may be misremembering that.
"We've run this test on every server we've gotten from you, including several others that were exactly the same config as this, this is the only one that's ever thrown errors". Usually support is really great, but they really didn't care in this case.
We finally contacted sales. "Uh, how long do we have to return this server for a refund?" All of a sudden support was willing to ship us out a replacement memory module (memtest86 identified which slot was having the problem), which resolved the problem.
They were all too willing to have us go to production relying on ECC to handle the memory error.
Good call in not accepting this. Even ignoring the possibility you have a double-bit error that causes a crash, or a triple-bit error that maybe can't be detected, frequent ECC errors are problematic. I've encountered machines that consistently ran my software horribly slowly. I don't remember specifics, but let's say at least 100X latency of other machines for similar operations. When I dug in, I found these machines had a huge amount of correctable memory errors. The correction apparently degrades performance significantly. I'm not sure exactly why, but I guess there's an MCE trap to report the memory error, and perhaps that path is slow.
But you’ve got it backwards about the incentives. A manufacturer has less incentive to deliberately ship a defective part in the case of ECC modules. If the modules consistently log ECC errors, they can easily be identified and returned under warranty to the manufacturer. A consumer is much less likely to identify an intermittent problem with a non-ECC part.
https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...
It would be a lot slower than real ECC but it could just be used for operations that would be especially vulnerable to bit flips. It would also not know for certain if the memory segment of data or the memory segment holding the checksum was corrupted besides their relative sizes (checksum is much smaller so more unlikely to have had a bit flip in it's memory region).
There's also the obvious tactic of just storing every logical 64-bit word as 128 bits of physical memory, which gives you room for all kinds of crap[1], at the expense of halving your effective memory and memory bandwidth.
0: This is extremely cheap since you're loading a 64- vs 128-bit value, with no extra round trip time and still fits in a cache line, so you're likely just paying extra memory use from larger page tables.
1: Offhand, I think you could fit triple or even quadruple error correction into that kind of space (there's room for eight layers of SECDED, but I don't remember how well bit-level ECC scales).
Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!
Finding a memory upgrade seems difficult though.
It is filled to the gunwales with ECC RAM.
Cost him the equivalent of $7k or so. Eeek.
0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/Thi...
Xeon with ECC are not that overpriced compared with similar Core without. Likewise, RAM sticks with ECC are cheap to produce (basically just one more chip to populate per side per module). Likewise soldered RAM would simply add maybe $10 or $20 of extra chips.
I'm guessing you won't find any.
Now back to ECC, I'll probably be corrected, but I don't think ECC helps gain more than two order of magnitudes, so we still need incredibly reliable RAM. If we move to ECC RAM by default everywhere, aren't we simply going to get less reliable RAM at the end?
As far as we know, our computer has never had an undetected error. -- Weisert
If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that's fine too.
Just meet the reliability spec - I don't care how.
That's why I've always been on the fence with this ECC thing. For servers it's vital because you need stability and security.
For desktops I think that for a long time it was fine without ECC. If I have to chose between having, say, 30% more RAM or avoid a potential crash once a year, I'll probably take the additional RAM.
The problem is that now these problem can be exploited by malicious code instead of just merely happening because of cosmic rays. That's the main argument in favour of ECC IMO, the rest is just a tradeoff to consider.
And that's probably not what GP asked for. There's a difference between guaranteeing an error rate of 1 error per century of use on average, and guaranteeing it over the course of an actual century. It might be okay to guarantee that error rate for only 5 years of uninterrupted use, and degrade after that. For instance:
Years 1- 5: 1 error per century.
Years 6-10: 3 errors per century.
Years 10-15: 10 errors per century.
Years 15-20: 20 errors per century.
Years 20-30: 1 error per *year*.
Years 30+ : the chip is broken.
Now, given how energy hungry and polluting the whole computer industry actually is, it might be a good idea to shoot for extreme durability and reliability anyway. Say, sustain 1 error per century, over the course of fifty years. It will be slower and more expensive, but at least it won't burn the planet as fast as our current electronics.Basically you can register domains using small bit differences for domains and start getting email and such for that domain
If I recall correctly the example given was a variation of microsoft.com
All because so much equipment doesn't use ECC
At Google even with ECC everywhere there wasn't enough systematic error detection and correction to prevent the global database of monitoring metrics from filling up with garbage. /rpc/server/count was supposed to exist but also in there would be /lpc/server/count and /rpc/sdrver/count and every other thing. Reminded me daily of the terrors of flipped bits.
I run some large ML models in my home PC and I get NaN's and some out of range floats every month or so. I have spent hours debugging but doing the same computation with the same random seeds does not recreate the problem.
How about GPU's and their GDDR SDRAM? Do they have parity bits?
[1]: https://twitter.com/catfish_man/status/1335373029245775872?l...
If you think it doesn't matter: how do you know? If you don't run with ECC memory, you'll never know if memory was corrupted (and recovered).
That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.
Who knows.
I'll tell you, who knows. God damn every sysadmin (or the modern equivalent) can tell you how often they get ECC errors. And at even a small scale you'll encounter them. I have, on servers and even on an SAN Storage controller, for crying out loud.
If you care about your data, use ECC memory in your computers.
I would have bought computers when I "wanted one". Now I buy them when I need one. Because buying a non-ECC computer just feels like buying a defective product.
In the last 10 years I would have bought TWICE as many computers if they hadn't segmented their market.
Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.
Price is not nice though.
I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can't get that, at all - even without the other fancy things I would like such as a 4k OLED with pen/touchscreen.
In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)
I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there's demand for "high end yet non bulky laptops"
I hope AMD will create a better market for the ECC laptop memory (right now it's hard to find + expensive).
Unfortunately, Lenovo is not selling the P53 anymore, which is exactly why I say I can't get that even in a "bulky" version.
I love that AMD doesn't intentionally break ECC on its consumer desktop platforms and upgraded to the Threadripper in 2017.
We only use Xeons on developer desktops and production machines here precisely because of ECC. It's about 1 bit flip/month/gigabyte. That's too much risk when doing something critical for a client.
I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)
If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.
I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)
If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?
(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)
Think of the number of events that can flip a bit. If you make bits smaller, you get a modestly larger number of events in a given area capable of flipping a bit, spread across a larger number of bits in that area.
That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.
In recent generations, I understand it's even been a bit paradoxical-- smaller geometries mean less of the die is actual memory bits, so you can actually end up with fewer flips from shrinking geometries.
And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.
Otherwise, I would think that an unlikely event becoming 1000x more likely by sheer numbers would have warped your perception.
I believe that hardware reliability is mostly irrelevant, because software reliability is already far worse. It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail, what matters is that this failure is handled gracefully.
The libraries we maintain (1) are responsible for a non-trivial part of Facebook's overall compute footprint, (2) should basically never fail of their own accord, and (3) have pretty good error monitoring. So my team is operating what is effectively (among other things) a very sensitive detector for hardware failure.
And indeed we see examples all the time of blobs that fail to decompress, and usually when we dig in we find that the blob is only a single bit-flip away from a blob that decompresses successfully into a syntactically correct message. I can't share numbers, but, off the top of my head, I think it's the largest source of failures we see. It happens frequently enough that I wrote a tool to automate checking [0].
So yes. It happens. Pretty frequently, in the sense that if you're doing xillions of operations a day, a one-in-a-xillion failure happens all the time.
[0] https://github.com/facebook/zstd/tree/dev/contrib/diagnose_c...
Nevertheless, anyone who uses the computer for anything else besides games or movie watching, will greatly benefit from having ECC memory, because that is the only way to learn when the memory modules become defective.
Modern memories have a shorter lifetime than old memories and very frequently they begin to have bit errors from time to time long before breaking down completely.
Without ECC, you will become aware that a memory module is defective only when the computer crashes or no longer boots and severe data corruption in your files could have happened some months before that.
For myself, this was the most obvious reason why ECC was useful, because I was able in several cases to replace memory modules that began to have frequent correctable errors, after many years with little or no errors, without losing any precious data and without downtime.
> It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail
Except that a bitflip can go undetected. It may crash your software or system, but it also may simply leak errors into your data, which can be far more catastrophic.
I can't give my source, but its far higher than most people think. Just pay the money.
Looks like `mcelog --client` might be a starting place? Feed that into your metrics pipeline and alert on it like anything else...
Screen, Wi-Fi, and to a much lesser extent (unless under load) the CPU are the most major culprits of low battery life.
See "Your computer is broken". They essentially inserted a stress test into the game that verified if the hardware was still doing calculations correctly, and if not, inform the user.
It is incomprehensible that there are still NAS devices being sold without ECC support.
Synology took a step in the right direction to offer prosumer devices with ECC but it is not really advertised as such. It is actually difficult to find which do have ECC and which ones don't.
I just look it up because if it was true it would have been news to me. Synology have been known to be stingy with Hardware Spec. But none of what I called Prosumer, the Plus Series have ECC memory by default. And there are "Value" and "J" Series below that.
Edit: Only two model from the new xx21 series using AMD Ryzen V has ECC memory by default.
- Edit -
Also, bit flips in the non-ECC memory are _the_ cause of the "bitrot" phenomenon. That is when you write out X to a storage device, but you get Y when you read it back. A common explanation is that the corruption happens _at rest_. However all drives from the last 30+ years have FEC support, so in reality the only way a bit rot can happen is if the data is damaged _in transit_, while in RAM, on the way to/from the storage media.
So, if you ever decide if to get an ECC RAM, get it. It's very much worth it.
Problems can definitely happen in the IO controller, RAID controller, cable, and disk controller. AFAIK all of these were seen and motivations for the existence of ZFS. One of their biggest insistence was that drives are universally lying bastards and should not be trusted any further than they can be thrown.
https://media-www.micron.com/-/media/client/global/documents...
When the value add feature becomes a necessity, it’s not a value add any more.
It seems redundant to have every module come with its own checking hardware.
For memory controller, parity/ECC/chipkill/RAIM usually involved simply adding additional memory planes to store correction data. I believe the rare exceptions are fully buffered memories where you have effectively separate memory controller on each module (or add-in card with DIMMs)
And now you have 8 bits of ecc per 32 data versus older DDR having 8 bits of ecc per 64 data. Hence the cost for dimm-wide ecc is going up.
Now the point about internally doing ECC is an interesting one, could be a way out of this mess. And apparently ECC is more available in AMD land
The first time I wrote "your" instead of "you're" in English I thought it was quite a milestone!
Yes it is. The problem is they dont really advertise it. I'm not certain but it might even be standard on AMD chips, but if they dont say so and board makers are also unclear, who knows...
Not if you're using a typical 72-bit SECDED code[0].
You have two error indicators: a summary parity bit (even number of errors: 0,2,etc vs odd number of errors: 1,etc), and a error index: 0 for no errors, or the bitwise xor of the locations each bit error.
For a triple error at bits a,b, and c, you'll have summary parity of 1 (odd number of errors, assumed to be 1), and a error index of a^b^c, in the range 0..127, of which 0..71[1] (56.25%, a clear albeit not overwhelming majority) will correspond to legitimate single-bit errors.
0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_wit...
1: or 72 out of 128 anyway; the active bits might not all be assigned contiguous indexes starting from zero, but it doesn't change the probability and it's simpler to analyse if summary is bit 0 and index bit i is substrate bit 2^i.
However I don't remember if there are provisions for ECC checking in case there are some dedicated refresh commands. I hope so, but I'm not sure.
So I'd say ECC is not only important but insanely impactful. There's a reason why many organizations don't even want to hear about getting rigs with non-ECC memory.
I understand altitude has some kind of proportionality to cosmic ray exposure, and number of bits will multiply the probability of an error.. I'm presuming there is also an inherent error rate to DRAM separate from environment. But what are those numbers.
"33 to 600 days to get a 96% chance of getting a bit error." Still, it seems way too high. I guess anyone with ECC RAM could confirm that they are getting those sort of recovered error rates?
My point is, when you say there is a "96% chance of having an error in THREE DAYS", one would EXPECT to be having issues like.. all the time? So I'm not disagreeing with you, but with the amount of non-ECC machines all over the world and how insanely stable modern machines are, it still seems like a very low risk.
Now of course I agree that if you want to take every precaution, go ECC, but simple observation prove that this "problem" can't be as bad as the numbers are saying.
But.. in all my time operating servers over 3 decades, it's always been bad drivers, bad code and problematic hardware that's caused most of my headaches.
Have i seen ECC error correction in logs? yeah.. I don't advocate against it but, i've found for most people you design around multiple failure scenarios more than you design around preventing specific ones.
Take the average web app - you run it on 10 commodity systems and distribute the load.. if one crashes, so what. Chances are, a node will crash for many more reasons other than memory issues.
If you have an app that requires massive amounts of ram or you do put all of your begs in one basket, then ECC makes sense...
I just know i like going horizontal and I avoid vertical monoliths.
Crashes might not matter, but silent data corruption does. The owner/user of that data will care when they eventually discover that it at some point mysteriously got corrupted.
How do you know?
The real killer is data corruption. Houw would you even begin to know that data is corrupted until it is too late?
It’s a tradeoff between money/performance and the frequency of crashes, corruption etc.
Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.
This is similar to my reasoning around the recent side channel attacks on intel CPUs. If I had a choice I’d like to run with max performance without the security fixes even though it would be less secure. Not because I don’t care about security but because 1% or 5% perf is a lot and I’d rather simply avoid doing anything security critical on the machine entirely than take that hit.
No, that's the big mistake people make: backups just backup bit-rotted data, until it is too late and the last good version is rotated out and lost forever.
On-die ECC DRAM is already here with us.
As long as we're swapping war stories there was once a frontend server at that company which died of bit flip related causes but just before it did it managed to charge some product group for 2^63 bytes worth of network traffic in the internal quota system, which set off every budgetary alarm that capacity planners had.
https://ark.intel.com/content/www/us/en/ark/compare.html?pro...
https://ark.intel.com/content/www/us/en/ark/products/208074/...
Problem is this processor is an Embedded processor so probably not for us
> Industrial Extended Temp, Embedded Broad Market Extended Temp
My understanding is Intel does not support ECC on the desktop unless you pay extra.
https://ark.intel.com/content/www/us/en/ark/products/199280/...
0: https://ark.intel.com/content/www/us/en/ark/products/199281/...
1: https://ark.intel.com/content/www/us/en/ark/products/134886/...
https://media-www.micron.com/-/media/client/global/documents...
The ambiguity makes it hard for me to say. "It's the best" vs "its is the best"
It doesn't feel right dropping the is
That's Intel's PR. Only "enterprise hardware", with a bigger markup, supports ECC memory. Adding ECC today should add only 12% to memory cost.
AMD decided to break Intel's pricing model. Good for them. Now if we can get ECC at the retail level...
The original IBM PC AT had parity in memory.
I don't know if AMD really intended to break Intel's pricing model. Their higher end Ryzen chips you'd use in servers and capital W Workstations don't seem to have a huge price difference from equivalent Xeons. Even if they're a bit cheaper you still need a motherboard that supports ECC so it seems at first glance to be a wash as far as price.
That being said if I was putting together a machine today it would be Ryzen-based with ECC.
You can actually, most AMD consumer chips (except the ones with integrated graphics) have ECC support, even though it's not officially supported. See this Reddit thread for more details: https://www.reddit.com/r/Amd/comments/ggmyyg/an_overview_of_...
[0] https://www.asrock.com/mb/amd/a320m-hdv%20r4.0/index.asp#Spe...
This is the legacy of Intel's policies.
Hello user, for an extra 10 dollars, would you like to guarantee that cosmic radiation never affects your computing experience, including having to reformat, reinstall, or otherwise reboot your phone/computer randomly when one day something just doesn't work anymore?
Was it an update? Was it cosmic radiation? Was it a bad capacitor? Who knows!
Why do you think so many problems are solved by rebooting? Sure 99 out of 100 might be software bugs, but the other 1 out of 100 is straight up cosmic radiation, and that 1 percent is growing more and more every year as software becomes more robust and bug free with better tooling.
Actually, a,b,c are also sampled from the range of valid bit indexes (not uniformly on 0..127), so you might be able to pick a cardinality-72 subset of 0..127 such that random a^b^c is disproportionately likely to fall outside that subset (and thus get diagnosed as not a valid single-bit error correction). I don't know that any existing ECC implementations actually do that, though.
Edit: did some cursory testing and using indexes 0..71 actually catches only ~24.04% (86016/357840) of triple-bit errors, compared the theoretical 43.75% (156555/357840, I think?) from a random error index. So "doesn't change the probability" is just completely wrong given that a,b,c are randomly chosen from the 72 substrate bits, rather than from 128 possible 7-bit indexes.
Oddly enough, testing random selections of 72 valid indexes (eg 74773982'EBD0D35C'C5BEB2D8'C3FE9C5E, where set bits correspond to used indexes) actually gives slightly better results than theory (44.97% for that one, 44.65% is the lowest in the last dozen or so), which is somewhat interesting, but I still haven't found any bit assignment that gives better than 50% (178920/357840) catchment of triple bit errors.
LPDDR5 will enable some much needed level of error correction in a metric ton of other future SoC designs too. I look forward to the future Raspberry Pi with built in error correction capabilities.
Again, I don't advocate NOT using ECC, but i'd say in complex systems, never assume ECC alone is enough... and if ECC becomes your champion cause, how could you enforce it through every device that touches data, provides data, consumes data or injects data?
Most will escape your attention.
ECC is supported on most Ryzen models[1], as long as the motherboard supports it. In fact, ASUS and ASRock (possibly others) have Ryzen motherboards designed for workstation/server use where ECC support is specifically advertised.
[1] The only exception is the Ryzen CPUs with integrated graphics.
ECC is not disabled. It works, but not validated for our consumer client platform.
Validated means run it through server/workstation grade testing. For the first Ryzen processors, focused on the prosumer / gaming market, this feature is enabled and working but not validated by AMD. You should not have issues creating a whitebox homelab or NAS with ECC memory enabled.
https://old.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_crea...
ECC support not being "validated," for all practical purposes, simply means that board vendors can advertise a board lacking ECC support as compatible with AMD's AM4 platform, without getting a nasty letter from AMD's lawyers.
However, I use only computers with ECC, previously only Xeons, but in the last years I have replaced many of them with Ryzens, all of which work OK with ECC memory.
When having to choose between a very small risk of losing the price of a CPU and having to use for sure, during many years, an Intel CPU with half of the AMD speed, the choice was very obvious for me.
The latter is a big problem, one of the extreme-OC guys (Buildzoid) who interacts frequently with the OEMs (as he is pushing their stuff to the limit and he frequently needs their help) has commented that AMD has a really bad problem with their BIOS teams. The AGESA firmware (the low-level code that the processor actually runs) is buggy as all hell at a firmware level and the OEMs are forced to patch around it in BIOS, but the AGESA firmware also has a massive problem with code churn, so these BIOS fixups basically stop working all the time. And the driver teams at a lot of OEMs are literally one person, so there isn't enough staffing there to test everything all the time. Long and short of it is: stuff breaks in AMD BIOSs, constantly, and they don't notice it.
This is obviously a huge problem when ECC is not an officially supported feature, because it means nobody is testing it! You might update your BIOS (as you frequently have to do with AMD machines) and suddenly ECC stops working, it might be running ECC in non-ECC mode and no longer correcting errors. Or it might have screwed up reporting them to the OS.
The server/workstation boards are the only ones you should be trusting Ryzen with ECC usage on.
That's not true. There are Core i3, Atom, Celeron, and Pentium SKUs with ECC. E.g. the Core i3-9300
That's an extreme claim. Why do you say so?
But I'm not entirely convinced that most of these requests weren't just typos. Though those requests with mismatched Host header surely were true bitflips.
And I am beyond certain they were not typos; the requests were for long URLs that no one would be typing by hand. A few were for services that humans never enter urls into (like the Windows crash reporter).
I can't find it right now, but I also saw someone doing this with some version of Google Now (or what it was back then) and serving different assets into it.
And I mean, we all spend all day editing test messages and comments and files on non-ECC hardware, yet bitflip-induced corruption is rare enough that I can't say that I've witnessed a single instance of it in my life, despite spending a good chunk of it looking at screens.
It's just not a problem that occurs in practice in my experience. If you're compiling the release build of a critical piece of software, you probably want ECC. If you're building the dev version of your webapp or writing an email to your boss, you'll probably survive without it.
I couldn't tell, as a user, which if those corruptions and crashes were causes by bitflips. Could you?
- That bit is not stored in RAM but rather somewhere within CPU which can have different bit flip frequency characteristics.
- Even assuming this bit can be flipped, most computers wouldn't have ECC RAM inserted because well... the CPU doesn't support it.
- And even assuming you have ECC RAM inserted, this bit flipping while your computer is running will likely lead to CPU hang considering the CPU didn't fill in error correction bits and will lead CPU to think that pretty much every single byte is corrupt.
- And all in all, even if you did manage to turn on ECC bit, it would only work until the computer gets turned off.
This is, of course, an anecdote rather than data, but 0 is different enough from the expected 768 that it makes me doubt that statistic.
In other words, if a single defective DIMM somewhere in your deployment is causing catastraphic failure, your mistake was not buying the wrong RAM modules. Your mistake was relying on a single point of failure for mission critical data.
Google and read up - it is a problem, has killed people, has thrown election results, and much more.
It's such a common problem than bitsquatting is a real thing :)
Want to do an experiment? Pick a bitsquatted domain for a common site, and see how often you get hits.
As for the case of bitflips killing someone: Bitflips are not the root cause here. The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail. Bitflips are just one of many reasons for hardware failure.
So those systems didn't fail when a bitflip happened?
> The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail.
The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy. (For context, I've written code for NASA, written a few proposals on making things more radiation hardened, and my PhD thesis was on a new class of error correcting codes - so I do know a little about making redundant software and hardware specifically designed to mitigate bitflips).
By claiming a bitflip didn't kick off the problems, and trying to push the cause elsewhere, you may as well blame all of engineering for making a device that can kill on failure.
So your argument is a red herring
>On the whole, you fail to make a case that preventing bitflips is the solution to a problem
Yes, had those bitflips been prevented, or not happened, those fatalities would not have happened.
>Ya, I'm not buying that biyflips are a problem.
If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I guess you define something being a problem differently than I or the ECC ram industry do.
See also this comment above: https://news.ycombinator.com/item?id=25623764
Also, the challenge host left useful state (which bit was flipped) in registers before running teams' code, without this I'm not sure if it is even possible.
DDR5 includes on-die ECC, where the RAM fixes the errors before sending them over the memory bus.
This means if the bus between the processor and ram corrupts the bits-- tough luck, they're still corrupted. And it's unclear whether we're going to get the quality of memory error reporting that we're used to or get the desired halt-on-non-recoverable error behavior (I've not been able to obtain/read the DDR5 specification as yet).
I'm happy to see people here on HN respect the difficulty of learning languages. Most foreigners that speak Finnish do it very poorly at first and even after decades they still sound like foreigners. But it shows huge respect to our small country for someone to make the effort, and we really appreciate it. I'm hoping other people see learning their own mother tongue the same way. Sure, most of us need English, but learning it well is still a huge task.
For consumer motherboard OEMs, only AMD effectively has ECC support (Intel's has been so spotty and haphazard from product to product), and of AMD users, only a small number care about ECC.
So motherboard companies, being resource and time-starved as they are, don't make it a priority to address such a small user-base.
If Intel started shipping ECC on everything, it would go a long way towards shifting the market.
http://lambda-diode.com/opinion/ecc-memory#:~:text=A%20syste....
[edit]
Looks like the calculation was revised [0] after criticism:
> Under these assumptions, you'll have to wait about 33 to 600 days to get a 96% chance of getting a bit error.
What's more worrying is the variance, the above calculation is based on expected well behaved DRAM.. yet some computers just seem to have manufacturing defects that make the incidence of errors high enough to be a regular problem.
The state was quite helpful, yes–for x86 it seems like a "clean slate" shellcode would be quite difficult, if impossible, to achieve as we saw. However, I am left wondering how other ISAs would fare…perhaps worse, since x86 is notoriously dense. But maybe not? The fixed-width ones would probably be easy to try out, at least.
[1] https://developer.arm.com/docs/ddi0596/h/top-level-encodings...
Was any code dumped anywhere?
I found this which corroborates everything you're saying but provides no further details: https://www.cspensky.info/slides/defcon_27_shortman.pdf
Coverage of the finals is usually much less detailed, unfortunately, since the number of teams is much smaller and the challenges don't necessarily go up. However, https://oooverflow.io/dc-ctf-2020-quals/ has a couple more writeups linked from it; https://dttw.tech/posts/SJ40_7MNS#proof-by-exhaustion from PPP and http://www.secmem.org/blog/2019/08/19/Shellcoding-and-Bitfli... from SeoulPlusBadass.
It's because they have two different uses (three if you count nested quotes, but those aren't common and are pretty easy to figure out), contractions and possession, and they seemingly collide on words like "its" where you'd think it could mean either.
Not sure if you've already learned this (or if it helps), but English used to be declined, and its pronouns still are, e.g. they/their/them. That's why "its" isn't contracted; the possessive marker is already in the word.
I spent my preschool years in a multicultural environment and English was our lingua franca(ironically the school-mandated language was French), so I didn't properly learn contractions until grade school - same with similarly sounding words like "than vs then" and "your vs you're".
Doing quorum at the computer level would require synchronizing parallel computers, and unless that synchronization were to happen for each low level instruction, then it would have to be written into the software to take a vote at critical points. This is going to be greatly detrimental both to throughput and software complexity.
I guess you could implement the quorum at the CPU level... e.g. have redundant cores each with their own memory. But unless there was a need to protect against CPU cores themselves being unreliable, I don't see this making sense either.
At the end of the day, at some level, it will always come down to probabilities. "Software engineering principles" will never eliminate that.
There are a lot of seemingly high-level problems that are solved (ingeniously) in hardware with very simple, very low-level solutions.
My first employer out of Uni had an option for their primary product to use a NonStop for storage -- I think HP funded development, and I'm not sure we ever sold any licenses for it.
Generally, it would make the most sense to kill the process if the corrupted page is data, but if it's code, then maybe re-load that page from the executable file on non-volatile storage. (You might also be able to rescue some data pages from swap space this way.)
What would be interesting is if userspace could mark a region of memory as recomputable. If the kernel is notified of memory corruption there, it triggers a handler in the userspace process to rebuild the data. Granted, given the current state of hardware; I can't imagine that is anywhere near worth the effort to implement.
I believe there's already some support for things like this, but intended as a mechanism to gracefully handle memory pressure rather than corruption. Apple has a Purgeable Memory mechanism, but handled through higher-level interfaces rather than something like madvise().
It would definitely be great to have more reliable hardware generally available and at less of a price premium.
You do the comparison on multiple nodes too. Get the calculations. Pass them to multiple nodes, validate again and if it all matches, you use it.
Recursion, see recursion.
I'm too lazy to run the exact numbers right now, but with "4 GB, 96% percent chance, three days" as the hypothesis, I think you'll find that an experimental result of "8 GB, 0% chance, 14 days" is highly statistically significant.
Edit: rough back of napkin estimate - you're not seeing an event in roughly 10x trials (2x number of bits and ~5x number of days). Given hypothesis is true your experimental result has a probability of (1-0.96)^10 = very very small. Conclusion: hypothesis is false.
Closest you can come to a nuc with ecc is I think a mini server equipped with one of the four-core i3 parts that have ecc.
As for NUCs, I thought there were some Atom chips with ECC but not in NUC form factor ? Something shiny from Logic Supply might do then...
My desktop machine is basically a gaming rig with disposable data. Hence the “performance over integrity”.
I also never rotate anything out. Every version of everything is in the backups. Storage is that cheap these days.
...unlike all the accountant machines which are just standard desktop grade ones that have invoices entered from their keyboards.
>Hence the “performance over integrity”.
If anything ECC would be better for performance, as it'd allow to clock memory higher. It's a mystery how Intel has managed to convince people ECC = enterprise/server market. The real cost of ECC is 1/8 more memory (and datatraces on the motherboard)... and if anything, virtually all intel cpus have the needed support/transistors in the memory controller, it's just desktop variants have that part fused off on purpose.
The current status is that it mostly doesn't exist outside OEM parts. Searching an online retailer for ECC brings me hundreds of results for "non-ECC" RAM. When they occasionally have 1-2 products with ECC, they aren't as aggressively binned as the non-ECC sticks. They'll have higher latencies and lower clock frequencies. Basically: you'll need to accept gambling on being able to overclock your ECC RAM.
Binary bitflip resilience is really cool. The radiation-hardened-quine idea (https://codegolf.stackexchange.com/questions/57257/radiation..., https://github.com/mame/radiation-hardened-quine) is cool, but these source-based approaches depend on a perfectly functioning and rather large (Ruby, V8, whole browser) binary stack. A bitflip-protected hex monitor or kernel, on the other hand...
Even with a triple-redundant quorum mechanism, slightly further up that stack you're going to have some bit of code running that processes the three returned results - if the memory that's sitting on gets corrupted, you're back where you started.
One advantage of microkernels is that the "watcher" is so small that it could be run directly from ROM, instead of loaded into RAM. QNX has advocated that route for robotics and such in the past.
Minix may not be the best example of the type. While it is a microkernel, it's real world reliability has been poor in the past. More mature microkernel operating systems like QNX and OpenVMS are better examples.
Nitpick/clarification: it currently supervises the security posture, attestation state and overall health of several billion(?) Intel CPUs as the kernel used by the latest version of the Management Engine.
If ME is shut down completely apparently the CPU switches off within 20 minutes. Presumably this applies across the full uptime of the processor, and not just immediately after boot, and iff this is the case... percentage of Intel CPUs that randomly switch off === instability/unreliability of Minix in a tightly controlled industrial setting.
I used to occasionally boot into QNX on my desktop in college. It was a very responsive and stable system.
Hypervisors are, to a first approximation, microkernels with a hardware-like interface. All of this kernel bypass work being done by RDBMSes, ScyllaDB, HFTs, etc. is, to a first approximation, making a monolithic kernel act a bit like a microkernel.
You might be referring to the previous versions. Minix 3 is basically a different OS, it's more than an educational tool - in fact it's probably running inside your computer right now if you have an Intel CPU (it runs Intel's ME chip - for better or worse).
As far as bitflips are concerned, having the critical kernel code occupy fewer bits reduces the probability of a bitflip causing an irrecoverable error.
(I'll archaic brag a bit by mentioning I used to be a heavy user of Minix - my floppy images came in over an X25 network - and saw Andy Tanenbaum give his Minix 3 keynote at FOSDEM about a decade ago. I'm a big fan.)
Anyway, while reducing risk this way is laudable, and will improve your fleet's health, as per TFA it's a poor substitute, with bad economics and worse politics behind it, than simply stumping up for ECC.
I'll also note that, for example, Google's sitting on ~3 million servers so that ~4k LoC just blew out to 12,000,000,000 LoC -- and that's for the hypervisors only.
Multiply that out by ~50 to include VM's microkernels, and the amount of memory you've now got that is highly susceptible to undetected bit-flips is well into the mind-blowing range.
So your hypothetical dialogue will sound more like a scam to a regular user. You're charging money to tackle a problem they don't have or see, only addressing a single one of the root causes that trigger that same result, and in the end you're not even completely fixing it, just reducing the already infinitesimal odds it happens.
It will become mainstream when manufacturers just decides to include it everywhere and not really give the user a choice. Apple is a prime candidate for a company with enough clout to afford this.
"You're/your", "their/they're", "its/it's" and the like are a different story, because I do pronounce those the same and they're all very common.
I kinda disagree because while the homophony works in (spoken) English in written it stands as a sore thumb. So yeah you will make it if you only heard it but doesn't know the written form.
(And in their native language it's probably two unrelated words, so that might intensify the feeling of wrongness)
>2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".
https://stackoverflow.com/questions/2580933/cosmic-rays-what...
I believe bitflips are far more common than people realize. I see weird shit all the time from my users that is only explainable due to bizarre software bugs or bitflips.
Don't get me wrong, I'm all for ECC memory, I practically showed the value of ECC after intentionally running the same multi-day Matlab job several times on a non-ECC machine with different results just to prove this very point. But you use your deeper knowledge to assume what someone with shallow knowledge and interest wants or needs, and that will almost always be off mark. Your car does not have a roll cage. Normal computers do not have ECC memory.
I don't know the same argument could be made and ecc be swapped with surge protectors and people still purchase those.
As such features that are unattractive to the regular consumer go into workstation/enterprise offerings where the buyer understands what they're buying and why.
Citation needed.
I would bet that your typical ram-purchasing consumer is not seeing or even considering the existence of the ECC model.
> ECC realistically adds to the BOM (board, modules) more than LEDs do. So the price goes up with seemingly no benefit for the user.
LEDs are a great opportunity to increase profit margin, so I'm not sure about your price conclusions.
> Citation needed.
It really isn't. It was a hypothetical choice between 2 models, with ECC or LEDs, at the same price. Hypothetical because most boards don't offer the ECC support at all, and certainly not at the same price.
> LEDs are a great opportunity to increase profit margin, so I'm not sure about your price conclusions
You confused manufacturing costs, price of the product, and profit margins. LEDs cost far less to integrate than ECC but command a higher price premium (thus better profit margins) from the regular consumer. Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.
There's a lot of variables that go into RAM errors, including manufacturing quality and condition of the ram, the dimm, the dimm slot, the motherboard generally, the power supply, the wiring, and the temperature of all of those. Google was known for cost cutting in their servers, especially early on; so I wouldn't be surprised if some of that resulted in higher bitflip rate than running in commercially available servers. Things like running bare motherboards, supported only on the edges cause excess strain and can impact resistance and capacitance of traces on the board (and in extreme cases, break the traces).
No it doesn't. You're assuming an even distribution of errors, which is very much not the case.
Google found that the average number of errors is around that range, but they also found that only one third of their servers had any errors in a year.
I didn't say that. I'm saying that the root cause (as in "root cause analysis") is not the bitflip. Designating the bitflip as the root cause is like analyzing your drunk driving accident and concluding that the root cause must be ethanol, rather than your drinking habits.
> The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy.
Of course, and I'm not actually arguing that adding in ECC is completely worthless to that effect, though it is close to worthless. Luckily, ECC is quite cheap, if not free, so throwing it in there makes sense.
However, suppose ECC would increase the cost by several magnitudes, would it still be worth it? Obviously not. Redundancy alone reduces the probability of spurious failure by several magnitudes, and simply increasing redundancy would be far cheaper than adding in ECC.
> If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
My point is that bitflips either don't really matter, in case data integrity is not mission critical, or they don't actually solve the problem, in case data integrity is mission critical.
If you have solved the problem of data integrity through redundancy, then ECC doesn't make much of a difference anymore. If you haven't solved the problem, then ECC will only prevent a vanishingly small subset of disasters that are awaiting you.
> I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I don't care how often it happens. I care about the odds of a bitflip causing an actual problem. If a computer crashes, that's okay, it'll reboot. If any data were to be corrupted, it would most likely happen at the disk level and not the DRAM level.
> I guess you define something being a problem differently than I or the ECC ram industry do.
Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
Yes, that is clear.
> If you have solved the problem of data integrity...
As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.
> Redundancy alone reduces the probability of spurious failure by several magnitudes
ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.
Naive redundancy ignores almost a century of better method form forward error correcting codes. I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.
>Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
And we're done. If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems. Good luck.
The actual problem is binary. You either solved it, or you didn't. ECC is "free", but it doesn't actually solve the problem. Actually solving the problem requires engineering.
Of course there's a probabilistic element to it, but the problem is to drive the probability of failure to "vanishingly small". The utility of adding or removing a vanishingly small constant to another vanishingly small constant is vanishingly small. This is what ECC does for you.
> ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.
ECC reduces the probability of spurious failure due to bitflips in DRAM by several magnitudes. However, spurious failure can occur for so many more reasons that the bitflip issue becomes a vanishingly small part.
> I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.
As you know, having worked for NASA, this is the right choice under certain circumstances. If there are lives on the line and you have a choice between "not solving a problem" and "a terribly expensive solution", you should go with the latter.
> If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems.
ECC does not solve the problem of data integrity. If you actually solve the problem of data integrity, you will find that ECC becomes effectively redundant. Do we not fundamentally agree on this? If so, why not?
That's not to say ECC is entirely useless from an administrative standpoint. It makes DRAM bitflips one less thing to worry about. One less thing out of thousands of things. Commensurately, the cost of ECC in a given deployment, like its utility, is vanishingly small.
My old Honda crv however would turn traction control on if your pressure was low - which worked by applying brakes to wheels that were slipping. If you were going up a slippery hill you would soon have no power, sliding backwards nearly off the road in nowhere West Virginia on the way to a ski resort.
ECC memory is just as fast as non-ECC memory, and only cost a little more.
I am totally for ECC and was flabbergasted when it went away. But the article makes sense since I remember Intel pushing hard to keep it out of the consumer space. The freaking QX6800 didn't support ECC and it retailed for over a grand.
The entire discussion is about why ECC is not common and why ECC matters. There is no technical reason for ECC to be uncommon aside purposed market segmentation.
Of course, there is no intrinsic availability of ECC udimms for the retail market currently, however that does mean ECC has no use or benefits for a small extra production cost.
Why something hasn't been done is always a hard question to answer, since to succeed a lot of things have to go right, and by default none of them do. But one thing is that microkernels were more trendy in the 90s - r&d people are mostly doing things like "the cluster is the computer", unikernel, exokernel, rump kernel, embedded (eg tock), remote attestation since then (I'm not up to date on the latest).
For ECC, there's no reason to really expect that adding ECC to everything would really make things too expensive or slow long term. For a long time the reason why desktops did not have ECC was pretty much because intel wanted people who really need ECC to buy Xeons.
Exactly like how we try to recreate the effect of ECC hardware but without its "weight" (cost, complexity). For a problem that's not nearly as visible or life and death.
> too expensive or slow long term
Not "too" expensive long term still means more expensive especially short term, like when the person buys it. For an issue no average consumer is actually frustrated about. Why are we still debating why those consumers don't care about ECC? How is it different from hot swappable RAM for the average Joe? They'll upgrade or expand their RAM more often than they'll get frustrated about the effect of bitflips. So why not have hot swappable memory? Because as much as I'd love hot swappable anything, regular consumers neither care about it nor want to pay for it.
You're making a claim about what people would choose. If you have no related data, and logic could support multiple outcomes, then a claim like that is basically useless.
> You confused manufacturing costs, price of the product, and profit margins.
I'm not sure why you think this.
> Again supporting my statement that even if presented with 2 absolutely identical parts save for ECC vs. LEDs the vast majority of consumers will go for LEDs because they don't care or know about ECC.
Sure, if you don't tell them that it's ECC they won't pick the ECC part.
If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.
When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.
My claims are common sense and supported by the real life: regular people don't know what ECC is, and those who do find the problem's impact is too minor to get palpable benefits from fixing it. Why are you being pedantic if you aren't actually going to bring arguments at the same level you expect from me?
> If you actually do a fair test, and put them side by side while explaining that one protects them from memory errors and the other looks cooler, you can't assume they'll all pick the LED.
Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"? Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...
The benefits of LEDs are hard to miss (light) all the time. The benefits of ECC are hard to observe even in that fraction of a percent of the time. Human cellular "bitflips" happen every hour but they don't visibly affect you so you also consider it's not an issue that demands more attention, like constant solar protection. People aren't keen on paying to solve problems they never suffered from, or even noticed, especially when you tell them they happen so often still with no obvious impact. Unless they have no choice, like OEMs selling ECC RAM only devices.
Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?
> When people never even think of ECC, that is not evidence that they wouldn't care or know about it in a head-to-head competition.
Well then, I guess none of us has any evidence except today people buy LEDs not ECC RAM. Educate people or wait until manufacturing process and design are so susceptible to bitflips that people notice and it will be a different conversation.
Regular people aren't given the choice! The things you're quoting about the real world to support your argument are incompatible with a scenario where someone is actually looking at ECC and LED next to each other. And I'm not being "pedantic" to say that, it's a really core point.
> Isn't this exactly the kind of claim you yourself characterize one paragraph above as "useless" because "you have no related data, and logic could support multiple outcomes"?
A claim of a specific outcome is useless. "you can't assume" is another way of phrasing the lack of knowledge of specific outcomes.
> Sure, if people were more tech-educated then my assumption might be wrong. But people aren't more educated so...
It's the kind of thing that can go on a product page. But first someone has to actually make a consumer-focused sales page for ECC memory, and the ECC has to be plug-and-play without strong compatibility worries.
And just like when LEDs spread over everything, it's something that you can teach people about and create demand for with a bit of advertising.
> Sell me ECC memory when my (actual real life) 10 year old desktop or 5 year old phone never glitched. Sell me ECC RAM when my Matlab calculations come back different every time. See the difference?
That's a clear picture of one person. But "never glitched" is a very dubious claim, and you can't blindly extrapolate that to how everyone would act.
I live in Denver but spend a lot of time skiing around 11k feet, maybe the higher elevation means more radiation.
Aside: I'm surprised you got a TPMS programming tool instead of a set of steelies. Big wheels? Multiple winter vehicles?
Interestingly enough, average background radiation is actually the highest in the US exactly where you are[0] and I've seen comments (on a few websites talking about it) ranging from "the Rockies are loaded in Uranium" to "what about the Rocky Flats nuclear weapons site". Perhaps you're exactly in the right place to be an advocate for ECC :)
Why does it matter? It doesn't idle that high; it only goes that high of you're using it flat out, in which case the extra power usage is justified because it's giving that much more performance over a 100 W TDP CPU. Now I totally get it if you don't want to go Threadripper just for ECC because it's more expensive, but max power draw, which you don't even have to use? I've never seen anyone shop a desktop CPU by TDP, rather than by performance and price.
Oh oh, me! Back in the day I bought a 65W CPU for a system that could handle a 90W. I wanted quiet and figured that would keep fan noise down at a modest performance penalty. It should also last longer, being the same design but running cooler. I ran that from 2005 until a few years ago (it still run fine but is in storage).
Planning to continue this strategy. I suspect it's common among SFF enthusiasts.
IMO, shopping by performance/watt makes sense. Shopping by TDP doesn't. (Especially since there is no comparing the AMD and Intel TDP numbers as they're defined differently; neither is the maximum the processor can draw, and Intel significantly exceeds the specified TDP on normal workloads).
My passive-cooled desktop is also running a slightly trottled down 65W CPU.
So yes, there are people who choose there hardware by TDP.
Get a huge cooler like Noctua d14, and you pc becomes silent. It lasts forever, requires no maintenance, a good investment.
If you are adventurous, watercooling is even better, but its a can of worms I decided I'd rather live without - possibility of leaks and cost make it harder to justify
That's me. When I start to plan for a new system, I select the processor first and read its thermal design guidelines (Intel used to have nice load vs. max temp graphs in their docs) and select every component around it for sustained max load.
This results in a more silent system for idle and peace of mind for loading it for extended duration.
You can passively cool threadrippers if you underclock them enough and have good ventilation in case.
ECC isn't validated by AMD for AM4 Ryzen models, but it's present and supported if the motherboard also supports it. Many motherboards have ECC support (the manual will say for sure), and a handful of models even explicitly advertise it as a feature.
I have a Ryzen 9 3900X on an ASRock B450M Pro4 and 64 GB of ECC DRAM, and ECC functionality is active and working.
Secondly, it probably also means that they do not include tests for this functionality when they perform the final tests against each fully assembled chip. I'd expect that a jtag boundary scan does verify that the bond wires are in place and work, but no functional tests of ECC are run on each processor in the consumer configuration.
The net result is that with a compatible motherboard and memory, ECC almost certainly works (since the memory controller is the same as in the supported model) but AMD does not officially guarantee it. It is much like overclocking. The functionality is present, and it should work, and most likely does, but AMD accepts no responsibility if it does not, since they don't formally test for it.
Found some here -- bottom of the EPYC product line starts at $2849 ...!
I don't know where you live, but around here, (if you buy new?), the vendor MUST take back items up to 15 days after they were delivered, for ANY reason.
So, as long as you synchronize your buying of CPU, RAM, (motherboard), you should be fine.
Any apples-to-apples comparable Intel CPU will have comparable power use. The difficulty is that Intel didn't really have anything like Threadripper — their i9 series was the most comparable (high clocks and moderate core counts), but i9 explicitly did not support ECC memory, nullifying the comparison.
You're looking at 2950X, probably? That's a Zen+ (previous gen) model. 16 core / 32 thread, 3.5 GHz base clock, launched August 2018.
Comparable Intel Xeon timeline is Coffee Lake at the latest, Kaby lake before that. As far as I can tell, no Kaby Lake nor Coffee Lake Xeons even have 16 cores.
The closest Skylake I've found is an (OEM) Xeon Gold 6149: 16/32 core/thread, 3.1 GHz base clock, 205W nominal TDP (and it's a special OEM part, not available for you). The closest buyable part is probably Xeon Gold 6154 with 18/36 core/threads, 3GHz clock, and 200W nominal TDP.
Looking at i9 from around that time, you had Skylake-X and a single Coffe Lake-S (i9-9900K). 9900K only has 8 cores. The Skylake i9-9960X part has 16/32 cores/threads, base clock of 3.1GHz, and a nominal TDP of 165W. That's somewhat comparable to the AMD 2950X, ignoring ECC support.
Another note that might interest you: you could run the Threadripper part at substantially lower power by sacrificing a small amount of performance, if thermals are the most important factor and you are unwilling to trust Ryzen ECC: http://apollo.backplane.com/DFlyMisc/threadripper.txt
Or just buy an Epyc, if you want a low-TDP ECC-definitely-supported part: EPYC 7302P has 16/32 cores, 3GHz base clock, and 155W nominal TDP. EPYC 7282 has 16/32 cores, 2.8 GHz base, and 120W nominal TDP. These are all zen2 (vs 2950X's zen+) and will outperform zen+ on a clock-for-clock basis.
> And though ECC is not disabled in Ryzen CPUs, AFAIK it's not tested in (or advertised for) those, so one won't be able to return/replace a CPU if it doesn't work with ECC memory, AIUI, making it risky.
If your vendor won't accept defective CPU returns, buy somewhere else.
> Though I don't know how common it is for ECC to not be handled properly in an otherwise functioning CPU; are there any statistics or estimates around?
ECC support requires motherboard support; that's the main thing to be aware of shopping for Ryzen ECC setups. If the board doesn't have the traces, there's nothing the CPU can do.
What specific configurations (CPU, MB, RAM) are known to work?
Let's say I have a Ryzen system, how can I check if ECC really works? Like, can I see how many bit flips got corrected in, say, last 24h?
You must check the specifications of the motherboard to see if ECC memory is supported.
As a rule, all ASRock MBs support ECC and also some ASUS MBs support ECC, e.g. all ASUS workstation motherboards.
I have no experience with Windows and Ryzen, but I assume that ECC should work also there.
With Linux, you must use a kernel with all the relevant EDAC options enabled, including CONFIG_EDAC_AMD64.
For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a kernel 5.10 or later, for ECC support.
On Linux, there are various programs, e.g. edac-utils, to monitor the ECC errors.
To be more certain that the ECC error reporting really works, the easiest way is to change the BIOS settings to overclock the memory, until memory errors appear.
Looking back at my notes, the output of journalctl -b tells should say something like, "Node 0: DRAM ECC enabled."
Then 'edac-ctl --status' should tell you that drivers are loaded.
Then you run 'edac-util -v' to report on what it has seen,
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.They aren't tested on it, so it's possible to get a dud, but it's minuscule chance that isn't worth bothering.
Now, to actual issues you can encounter: motherboards
The problem is that ECC means you need to have, iirc, 8 more data lines between CPU and memory module, which of course mean more physical connections (don't remember how many right now). Those also need to be properly done and tested, and you might encounter a motherboard where it wasn't done. Not sure how common, unfortunately.
Another issue is motherboard firmware. Even though AMD supplies the memory init code, the configuration can be tweaked by motherboard vendor, and they might simply break ECC support accidentally (even by something as simple as making a toggle default to false then forgot to expose it in configuration menu).
Those are the two issues you can encounter.
The difference with AFAIK Threadripper PRO, and EPYC, is that AMD includes ECC in its test and certification programs for it, which kind of enforces support.
PC C:\> wmic memphysical get memoryerrorcorrection
MemoryErrorCorrection
6
SuperUser has a convenient decoder[1], but modern systems will report "6" here if ECC is working.When Windows detects a memory error, it will record it in the system event log, under the WHEA source. As a side note, this is also how memory errors within the CPU's caches are reported under Windows.
[1] https://superuser.com/questions/893560/how-do-i-tell-if-my-m...
*not officially, and the memory controller provides no report for 'fixed' errors.
Edit: as detaro mentioned in the reply, there is, and here's the source [0] -- that's what they mean by "RAS" on promotional pages [1]. That indeed looks like a nice option.
[0] https://www.amd.com/system/files/documents/updated-3000-fami...
[1] https://www.amd.com/en/products/embedded-epyc-3000-series
One consequence of using a moving average is that if the CPU has been idle for a long time then starts running a high power workload instantaneous power consumption can momentarily exceed the TDP while the average catches up. This is often misleadingly referred to as "turbo mode" by hardware review sites. It's not a mode, there's no state machine at work here, it's just a natural result of using a moving average. The use of EWMA is meant to model the heat capacity of the cooling solution. When the CPU has been idle for a while and the heatsink is cool, the CPU can afford to use more power while the heatsink warms up.
Another factor which confuses things is motherboard firmware disabling power limits without the user's knowledge. Motherboards marketed to enthusiasts often do this to make the boards look better in review benchmarks. This is where a lot of the "Intel is lying" comes from, but it's really the motherboard manufacturers being underhanded.
The situation on the AMD side is of course a bit different. AMD's power and frequency scaling is both more complex and much less documented than Intel's so it's hard to say exactly what the CPU is doing. What is known is that none of the actual power limits programmed into the CPU align with the TDP listed in the spec. In practice the steady state power consumption of AMD CPUs under load is typically about 1.35x the TDP.
Unlike Intel, firmware for AMD motherboards does not mess with the CPU's power limit settings unless the user does so explicitly. Presumably this is because AMD's CPU warranty is voided by changing those settings, while Intel's is not.
https://www.intel.com/content/www/us/en/architecture-and-tec...
And on the flip side, if you're building a desktop PC with a more high-end Intel processor, you will usually have to change a lot of motherboard firmware settings to get the behavior to resemble Intel's own recommendations that their TDP numbers are supposedly based on. Without those changes, lots of consumer retail motherboards default to having most or all of the power limits effectively disabled. So out of the box, a "65W" i7-10700 and a "125W" i7-10700K will both hit 190-200W when all 8 cores/16 threads are loaded.
If a metric can in practice be off by a factor of three in either direction, it's really quite useless and should not be quantified with a scientific unit like Watts.
May be we should use a new term for it, something like iTDP.
If they gave it some other name, it would be only misleading. Calling it TDP is a lie.
Intel mobile processors actually obey TDP better than AMD processors do - Tiger Lake has a hard limit, when you configure a 15W TDP then it really is 15W steady-state once boost expires, while AMD mobile products will pull up to 50% more than configured in steady-state operation. (the gap is larger than desktop)
https://images.anandtech.com/doci/16084/Power%20-%2015W%20Co...
"the brands measure it differently" is sort of theoretically true but not in the sense people think, and in practice it is not true.
On AMD it is literally just a number they pick that goes into the boost algorithm. Robert Hallock did some dumb handwavy shit about how it's measured with some delta-t above ambient with a reference cooler but the fact is that the chip itself basically determines how high it'll boost based on the number they configure, so that is a self-fulfilling prophecy, the delta-t above ambient is dependent on the number they configure the chip to run at.
In practice: what's the difference between a 3600 and a 3600X? One is configured with a TDP of 65W and one is configured with a TDP of 95W, the latter lets you boost higher and therefore it clocks higher. Configure them both to a 65W PPT limit and they will boost to pretty much the same place.
Intel nominally states that it's measured as a worst-case load at base clocks, something like Prime95 that absolutely nukes the processor (and even then many processors do not actually hit it). But really it is also just a number that they pick. The number has shifted over time, previously they used to undershoot a lot, now they tend to match the official TDP. It's not an actual measurement, it's just a "power category" that they classify the processors as, it's informed by real numbers but it's ultimately a human decision which tier they put them in.
So in practice, for both brands, it is just a number they pick. They have different theoretical methods for getting there but ultimately the marketing department looks at where the clocks would put them and pick a power number that they think represents that. It is not, in practice, a pure measurement from either brand, it is just a "category" they use.
Real-world you will always boost above base clocks on both brands at stock TDP, at least on real-world loads. You won't hit full boost on either brand without exceeding TDP, the "AMD measures at full boost" is categorically false despite the fact that it's commonly repeated. AMD PPT lets them boost above the official TDP for an unlimited period of time, they cannot run full boost when limited to official TDP.
You can also use memtest86+ for this, although I don't recall if it requires specific configuration for ECC testing.
There are computers in the Intel NUC form factor, with ECC support (e.g. with Ryzen V2718), e.g from ASRock Industrial.
These days, a quiet, pwm fan with good thermal paste (and maybe some linux CPU throttling) more than achieves my needs for a "silent" pc 99% of the time.
I would love to be told my above assumptions are wrong if they are.
The worst bit is, AMD and Intel define TDP differently-- neither is the maximum power the processor can draw-- though Intel is far more optimistic.
I think some Gigabyte boards are infamous for this in certain circle
OTOH: Gigabyte might have a Threadripper PRO motherboard (WRX80 chipset) coming out in the future
EPYC TDP ranges from "a lot lower" (35W embedded, 120W regular) up to "comparable with" (180-240W, a single 280W model) relative to Threadripper (180-250W last gen; current gen is all 280W). It's definitely not a lot higher on the Epyc side.
> I think of Threadripper as mid-tier, and EPYC as the high-end.
This oversimplifies to the point of not being a useful intuition (or is arguably even incorrect). Threadripper is a SKU with a moderate number of cores at relatively high clocks; (high-power) EPYC SKUs have a lot of very efficient cores running at lower clocks. They both have a niche, but Threadripper has unambiguously better single-core performance due to the ~20% higher clocks. And single-core IPC still matters in many applications (to oversimplify: Amdahl's law; but also, latency-sensitive applications).
In my case loading means maxing out all cores and extended period of time can be anything from five minutes to hours.
Both are optimistic lies, but-- if you look at the documents it looks like currently AMD needs more cooling, but actually dissipates less power in most cases and definitely has higher performance/watt.
Doesn't matter for me since I'm not interested in comparing them.
> Both are optimistic lies, but-- if you look at the documents it looks like currently AMD needs more cooling, but actually dissipates less power in most cases and definitely has higher performance/watt.
I'm aware of the situation, and I always inflate the numbers 10-15% to increase headroom in my systems. The code I'm running is not a most case code. A FPU heavy, "I will abuse all your cores and memory bandwidth" type, heavily optimized scientific software. I can sometimes hear that my system is swearing at me for repeatedly running for tests.
I don't like to add this paragraph but, I'm one of the administrators of one of the biggest HPC clusters in my country. I know how a system can surpass its TDP or how can CPU manufacturers skew this TDP numbers to fit in envelopes. We make these servers blow flames from their exhausts.
Performance/watt metrics and idle consumption would have been a far better way to make this choice.
If you have a choice between A) something that can dissipate 65W peak for 100 units of performance, but would dissipate 4W average under your workload, and B) something that can dissipate 45W peak for 60 units of performance, but would dissipate 4.5W under your workload... I'm not sure why you'd ever pick B.
Also, even though the CPU may draw less, can still the power supply waste more, just because it is beefy? Comparing with a sports car, they have great performance, but also use more gas in ordinary traffic? Can a computer be compared with that?
Community benchmarks, from Tom's Hardware, etc.
The vendor numbers are make believe-- you can't use them for power supply sizing or for thermal path sizing. If you look at the cited TDP numbers today-- it can be misleading-- e.g. often Intel 45W TDP parts use more power at peak than AMD 65W parts.
On modern systems, almost none of the idle consumption is the processor. The power supply's idle use and motherboard functions dominate.
> Also, even though the CPU may draw less, can still the power supply waste more, just because it is beefy?
Yes, having to select a larger power supply can result in more idle consumption, though this is more of a problem on the very low end.
In a sandwich style case you're usually limited to low profile coolers like Noctua L9i/L9a since vertical height is pretty limited.
If you want a 45W TDP from the 3700X, you can just pop into Ryzen Master and ask for a 45W TDP. Boom, you're running in that envelope.
I think shopping based on TDP is not the best, because it's not comparable between manufacturers and because it's something you can effectively "choose".
I don't really see the reason in paying for a 100w TDP premium if I'm just going to scale it down to 65w.
As a petty "Take that", I dropped the max frequency from 2.0 GHz to 1.0 GHz. I ran a couple benchmarks to prove the cap was working, and then just kept it at 1.0 for a few months, to prove my point.
It made a bigger difference on my ARM SBC, where I tried capping the 1,000 MHz chip to 200 or 400 MHz. That chip was already CPU-bound for many tasks and could barely even run Firefox. Amdahl's Law kicked in - Halving the frequency made _everything_ twice as slow, because almost everything was waiting on the CPU.
And the relationship between power and performance isn't linear as processor voltages climb trying to squeeze out the last bit of performance.
So if you want to take a 105W CPU and ask it to operate in a 65W envelope, you're not giving up even 1/3rd of peak performance, and much less than that of typical performance.
Yup, they're out there.
> I don't really see the reason in paying for a 100w TDP premium if I'm just going to scale it down to 65w.
You might want the core count or peak performance for the very short term. When I was looking, running 65W parts in the 45W envelope was only about a 7% penalty, so you get a bunch more performance/watt.
AMD Ryzen 9 5950X: 20.6W for a single core at 5050MHz, 49W for the whole package. (And it’s generally the package figure that you care about.)
AMD Ryzen 9 5900X: 17.9W/54W at 4875MHz.
AMD Ryzen 7 5800X: 17.3W/37W at 4825MHz.
AMD Ryzen 5 5600X: 11.8W/28W at 4650MHz (though the highest core reading is 13W, at three cores loaded).
You’re both correct: by simply restricting that power envelope by 40%, you shed a lot less multi-threaded performance than people realise, and no single-threaded performance.
Look at the 5950X figures, and you observe that at about 120W, it can run 6 cores at 4,650MHz (27,900 core–MHz), or 16 cores at 3,775MHz (60,400 core–MHz).
Expressed one way: by dropping the frequency by 20%, power per watt increased by around 2.7×.
Expressed another way: let’s skip a 65W envelope—put this particular 105W chip in a 40W envelope and you lose only 20% of your six-cores performance. Seriously. But I’m not sure what the curve would look like if you load all 16 cores at a 40W envelope, what speed they’d be going at.
(But do remember that “TDP” is a bit of a mess as a concept, and that we’re depending on non-core power consumption being generally fairly consistent regardless of load.)
On AMD, it's a utility you run. I believe you may require a reboot to apply it. On some Intel platforms, it's been settings in the BIOS.
> It sounds interesting if I can run a beefy rig as a power efficient device, for always-on scenarios, and then boost it when I need.
This is what the processor is doing internally anyways. It throttles voltage and frequency and gates cores based on demanded usage. Changing the TDP doesn't change the performance under a light-to-moderate workload scenario at all.
Ryzen Master lets you change some of the tuning for the choices it makes about when and how aggressively to boost, though, too.