EPYC 7002 CPUs may hang after 1042 days of uptime

EPYC 7002 CPUs may hang after 1042 days of uptime(old.reddit.com)

159 points by gfv 3 years ago | 104 comments

I feel like some of the comments here are missing the point. Yes it's only likely to effect a small number of users, so did the intel fdiv bug, both are defective products.

Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

Waterluvian 3 years ago | |

It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

arp242 3 years ago | | |

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.

gchamonlive 3 years ago | | |

It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

0xr0kk3r 3 years ago | | |

To be fair, it was possible to tell what operations would be off in the FDIV bug and by how much. It was 100% deterministic. Problem was, checking all the operands in SW before performing the computation to make adjustments completely defeated the purpose of having an FPU.

KptMarchewa 3 years ago | | |

Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

gfv 3 years ago | |

There are always bugs in silicon, just like there are bugs in software. They mostly show up under "a highly specific and detailed set of internal timing conditions". There are 40 documented erratas on EPYC 7002s alone; there are 35 in the 13gen Intel CPUs, including, curiously, RPL038, "Processor Exiting Package C6 or C8 May Hang". Mobile ARM chip manufacturers are notoriously bad at documenting their bugs, so who knows how many they have.

This one is interesting because its preconditions are so trivial, and it will affect many more people than usual.

PragmaticPulp 3 years ago | |

> Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

This bug only applies to servers that haven’t been rebooted for 3 years and have the CC6 sleep state enabled. It can be worked around by disabling CC6 sleep state or rebooting once every 3 years.

If you think operators of these servers can’t be bothered to update and reboot their machines once in 3 years or change a single BIOS setting, what makes you think they’d be interested in tearing down their servers, physically replacing the CPU, and reassembling all of them with the associated downtime and inevitable accidental damage to some units? Nothing about that makes sense from a business perspective.

xp84 3 years ago | | |

I’m picturing a long 50’ aisle filled with racks and a guy with a huge box marked “replacement CPUs” and a screwdriver.

Good lord, can you imagine how long just a few of those would take in a data center?

dspillett 3 years ago | |

A key difference between then and now is how much easier it is to distribute software/firmware workarounds or fixes. From an end users perspective replacing the CPU might be seen as far easier than updating their software. A software fix would affect performance, so of course it isn't as simple as that, but this difference is part of the dynamic.

Also, as a direct user of the CPU, if the fdiv bug would impact you it would affect you often rather than once every three years which is the impact frequency of this fault.

Another matter that affected the fdiv bug is that the Pentium line was the first time a CPU had been aggressively marketed directly at the general public in quite the way it was. Prior to that only manufacturers and techies would have known about it and they were used to errata for hardware components. The public more generally had an impression that hardware (at least undamaged hardware) was reliable and only software had bugs, and the fdiv bug invalidated that view of reality causing a bit of a panic.

jeroenhd 3 years ago | |

These types of bugs have been in hardware forever. Nobody is going to replace hundreds of EPYC servers even if they could get a free replacement from AMD.

There are definitely cases where hardware should be exchanged with fixed chips, particularly the small business/consumer/hobbyist range where exchanging CPUs is worth the time and effort. The RDRAND problem with Ryzen chips was much worse because it actually happened all the time and there is still no microcode fix available for some motherboards (though AMD already makes the fix available so it's more of an issue about a lack of motherboard support than broken hardware).

znpy 3 years ago | |

> today we seem too willing to put up with being sold broken stuff.

i remember reading that when hard disks just came into the mass market they were so expensive that having some bad sectors was not such a big deal... and so hard disk would usually come with a sheet of paper listing the known broken sectors (detected at QA stage, i guess).

maybe someone older than me (i guess somebody in their 50ies or 60ies) could confirm that.

bitwize 3 years ago | | |

I'm not that old, but I remember seeing bad sector lists as stickers on some hard disks.

I'm not sure if that ever went away, though... I think the IDE firmware in more modern hard disks knew how to redirect bad sectors to good sectors, so the end user never even noticed.

sidewndr46 3 years ago | |

I'm way too young to remember it clearly but from what I was told it was nothing of the sort. Intel announced that they had identified a bug and would review on a case by case basis to see who was affected and would determine if you were worthy of getting a CPU that was fixed.

Again, this is secondhand but from people who worked directly in the industry at the time.

atmavatar 3 years ago | |

Don't divide, Intel inside!

Neil44 3 years ago |

It seems a C6 state is an individual core sleeping. The intersection of people who don't reboot for 3 years and people who have sleep states enabled must be pretty small. It's an interesting bug though!

vegardx 3 years ago | |

I had a very similar issue with some AMD-based servers (bulldozer, I think) about ten years ago. There was a bug where Xen-based virtual machines could set a C-state on cores it was assigned, but for whatever reason it wasn't able to wake them up. It was fun trying to figure out what the heck was going on.

icybox 3 years ago | | |

I have C-states already disabled because of old linux kernel bug where the kernel hang on Zen3 architecture. So not much to see here :)

lUserAMD 3 years ago | |

EPYCs have many cores, and most applications (including those with long uptime requirements) use only a subset of the cores continuously. So it is totally normal for some of the cores to go to deep sleep C6 during phases of lower load. It will cause server operators headaches, when those cores don't come back eventually. Reboots help, disabling C6 in the (already running) OS also helps.

Please note, that we are not talking about a core sleeping for three years. We are talking about a core going to deep sleep, when the system has been up for three years or longer.

neilv 3 years ago |

Reminds me of the Intel Atom C2000 series brickings, circa 2017.

https://www.anandtech.com/show/11110/semi-critical-intel-ato...

https://www.servethehome.com/intel-atom-c2000-series-bug-qui...

nh2 3 years ago |

I filed a kernel bug 'System thrashes with "AMD-Vi: Completion-Wait loop timed out" after 247 days of uptime'

for AMD Ryzen 7 3700X

https://bugzilla.kernel.org/show_bug.cgi?id=217257

Might this be potentially related?

eqvinox 3 years ago |

MSR-poking Tool for Zen1 Ryzen CPUs to disable C6: https://github.com/r4m0n/ZenStates-Linux/blob/master/zenstat...

Not sure if this is applicable to EPYC CPUs, probably not. But I would expect that it's possible to disable C6 in some similar way on EPYC CPUs without rebooting the system. (If you are actually at risk of running into this issue, you likely don't want to reboot the system…)

tedunangst 3 years ago |

The good news is now I know why my server crashed last month, and it wasn't some other defect.

SpaghettiCthulu 3 years ago | |

You've had a Ryzen 7000 series CPU running for nearly 3 years already?

mattpallissard 3 years ago |

This happened with a higher end Cisco switch (the model escapes me) we used in our core many moons ago. Stopped passing traffic completely after a number of days.

At least Cisco told us about it themselves. We just fail-over rebooted until they fixed it.

msla 3 years ago |

Previously:

https://news.ycombinator.com/item?id=28340101 Watch Windows 95 crash live as it exceeds 49.7 days uptime [video]

bushbaba 3 years ago |

In general it is good practice to have machines hard-restart every now and then. Otherwise you run into some weird edge-cases and rely too much on things being up and running 24x7x365

dale_glass 3 years ago |

A machine staying up for almost 3 years is irresponsible in this day and age.

Yeah, I remember people having uptime competitions on Slashdot and the like some decades back, but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.