EPYC 7002 CPUs may hang after 1042 days of uptime(old.reddit.com) |
EPYC 7002 CPUs may hang after 1042 days of uptime(old.reddit.com) |
Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.
One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.
I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.
Seems comparably problematic to me.
This one is interesting because its preconditions are so trivial, and it will affect many more people than usual.
This bug only applies to servers that haven’t been rebooted for 3 years and have the CC6 sleep state enabled. It can be worked around by disabling CC6 sleep state or rebooting once every 3 years.
If you think operators of these servers can’t be bothered to update and reboot their machines once in 3 years or change a single BIOS setting, what makes you think they’d be interested in tearing down their servers, physically replacing the CPU, and reassembling all of them with the associated downtime and inevitable accidental damage to some units? Nothing about that makes sense from a business perspective.
Good lord, can you imagine how long just a few of those would take in a data center?
Also, as a direct user of the CPU, if the fdiv bug would impact you it would affect you often rather than once every three years which is the impact frequency of this fault.
Another matter that affected the fdiv bug is that the Pentium line was the first time a CPU had been aggressively marketed directly at the general public in quite the way it was. Prior to that only manufacturers and techies would have known about it and they were used to errata for hardware components. The public more generally had an impression that hardware (at least undamaged hardware) was reliable and only software had bugs, and the fdiv bug invalidated that view of reality causing a bit of a panic.
There are definitely cases where hardware should be exchanged with fixed chips, particularly the small business/consumer/hobbyist range where exchanging CPUs is worth the time and effort. The RDRAND problem with Ryzen chips was much worse because it actually happened all the time and there is still no microcode fix available for some motherboards (though AMD already makes the fix available so it's more of an issue about a lack of motherboard support than broken hardware).
i remember reading that when hard disks just came into the mass market they were so expensive that having some bad sectors was not such a big deal... and so hard disk would usually come with a sheet of paper listing the known broken sectors (detected at QA stage, i guess).
maybe someone older than me (i guess somebody in their 50ies or 60ies) could confirm that.
I'm not sure if that ever went away, though... I think the IDE firmware in more modern hard disks knew how to redirect bad sectors to good sectors, so the end user never even noticed.
Again, this is secondhand but from people who worked directly in the industry at the time.
Please note, that we are not talking about a core sleeping for three years. We are talking about a core going to deep sleep, when the system has been up for three years or longer.
https://www.anandtech.com/show/11110/semi-critical-intel-ato...
https://www.servethehome.com/intel-atom-c2000-series-bug-qui...
for AMD Ryzen 7 3700X
https://bugzilla.kernel.org/show_bug.cgi?id=217257
Might this be potentially related?
Not sure if this is applicable to EPYC CPUs, probably not. But I would expect that it's possible to disable C6 in some similar way on EPYC CPUs without rebooting the system. (If you are actually at risk of running into this issue, you likely don't want to reboot the system…)
At least Cisco told us about it themselves. We just fail-over rebooted until they fixed it.
https://news.ycombinator.com/item?id=28340101 Watch Windows 95 crash live as it exceeds 49.7 days uptime [video]
Yeah, I remember people having uptime competitions on Slashdot and the like some decades back, but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.
Just because it would be dangerous for your nodejs web_app.exe running on ubuntu behind apache fully exposed on the internet
then there are billion other ways to use computers, like even air gapped systems.
So, dont try to justify obvious flaw
Yeah, you can do stuff to maximize uptime but if it needs to stay up that badly you have to consider the case of the hardware needing to be turned off at some point.
> So, dont try to justify obvious flaw
I'm not, it's a bug and should be fixed. But I think if anything is powered for 3 years straight it's a bit concerning.
Otherwise you're liable to find things like that somebody started something by hand 2 years ago, and at a critical moment nobody quite remember what the command was.
1840 - The Oxford Electric Bell
1871 – Souter Lighthouse in South Shields, UK
1896 – The Isle of Man’s Manx Electric Railway
1902 – The Centennial Bulb
Apparently, "The Centennial Bulb has seen just two interruptions: for a week in 1937 when the Firehouse was refurbished, and in May 2013 when it was off for nine and a half hours due to a failed power supply."[1] https://www.youtube.com/watch?v=LZTaXjt2Ggk
[2] https://www.drax.com/electrification/4-of-the-longest-runnin...
BUT this doesn't mean you need to have downtime, in the same way a train unit in a railway system going through maintenance doesn't mean your railway system has downtime.
Redundancy is a must have feature for reliable systems and that means you system must be able to cope with random hardware failure or rebooting a server unit.
And both planned and unplanned maintenance of components are important normal business which in a well desingned reliable system should not lead to downtime.
Similar testing failure cases is important and should be done.
so either you don't run a high reliably system (and likely don't run into this bug ever), or you run a proper reliable system (and it's not a big deal), or you run a badly desingned or operated system pretending to be high reliably but but really being that... which is irresponsible (if you are aware)
I mean, the centennial bulb barely glows, that's why it still works. The hotter the filament gets the faster it evaporates, so a light bulb that barely makes any light can stay working forever.
You don't need to reboot a machine to update ssh.
You only need to reboot the machine to update the kernel; for everything else, you just have to restart the corresponding user-space processes (and even PID1 can re-exec itself). Most kernel vulnerabilities are not remotely exploitable, so as long as you can trust your user-space processes (and keep them updated), it should be safe enough.
Yeah, you technically can replace on-disk files while services are running.
In practice this can cause trouble if an application wants to read an updated file at the wrong time, and library dependencies can require restarting a lot of stuff.
For ages people would install an update containing a security fix in glibc or libz or something, and keep on running the vulnerable version of the services that use them.
At that point you might as well reboot.
Modern Fedora has a very Windows-like mechanism where you reboot to update. You reboot, the system installs updates, then reboots again.
(they replaced 40 million of those things..)
https://arstechnica.com/information-technology/2011/06/rsa-f...
I've worked in places where expensive Lab equipment is running off outdated PCs/servers because updates aren't available and they will absolutely stay on for as long as possible.
We're not all silicon valley, things can be expensive and difficult to replace...
That does not require a reboot, `systemctl daemon-reexec` is enough.
The problem is, if we can't expect software to run essentially forever, to update without 'restarts', and so forth, how are we ever going to achieve neural chip implants, artificial organs, synthetic agents mining ore in outer space, and so on? Software is not a gear mechanism, a rack and pinion, there is absolutely no reason to restart an 'operating system' or to ever lose state, however we became accustomed and we commit these sort of crimes daily, restarts and refreshes.
But if you need a single system to stay up for 3 years straight that's probably not good. There's too much going on in a modern high tech server for that to be a good idea. Everything has a CPU in it (including disks, video cards, network cards, etc). And any of that could make your system unusable by hitting some rare condition.
> The problem is, if we can't expect software to run essentially forever, to update without 'restarts', and so forth, how are we ever going to achieve neural chip implants, artificial organs, synthetic agents mining ore in outer space, and so on?
I would hope such things to be purpose-made and to be made in a way that the user can survive a reboot/firmware update. Eg, your neural implant should be built in such a way that it's not going to be life threatening if the battery runs out. The system has to be designed with that accounted for.
Maybe there's a secondary, minimal implementation acting as a backup and keeping critical functions working while the fully featured one is being updated. Hopefully everything is implemented in a failsafe way so that if it completely stops working you're not in a worse state than before you got it.
Any plan where there's a crucial component that must not stop even for a second isn't a very good plan.
Our bodies, just think of our hearts or lungs, don't stop for even a second for 80 something years, and even that 80 is most probably arbitrary with very few changes in cellular control (instead of cancer, cooperate; instead of scar, regenerate [1]). No current software artifact can boast with such a performance. That's the main issue, our technology does not establish a hierarchy of competence [2], where each layer is independently able to solve problems such as the cell-tissue-organ-organism continuum. We must start digitizing the material, assemble assemblers that can assemble themselves [3].
[1] Dr. Michael Levin: Xenobots, Limb Regeneration, and The Power of Cellular Communication, https://www.youtube.com/watch?v=H_TyON2xWeQ
[2] Michael Levin, What do bodies think about?, https://www.youtube.com/watch?v=CVr1OkDqnmo "Nested Cognition, not Merely Structure" starts at 4:32
[3] Neil Gershenfeld, How to Make Almost Anything, The Digital Fabrication Revolution, http://cba.mit.edu/docs/papers/12.09.FA.pdf
At a generic system level, for example upgrading Nixos will pull new packages and put them next to the current ones, then reexec where possible. Nginx can replace its master process (SIGUSR2). Telephony software can often reexec and keep connecting open. Etc.
Outside of desktops it's not that uncommon to do seamless live reloads of the whole system.
Also out of superstition, I avoid hibernate -- when I walk away, it's either on and locked or shutdown. (I also did this on Windows; a mixed state just seemed off-puttingly and worryingly complex to me.)
Given what you said, and because I hear hibernation is notoriously buggy on Linux, both superstitions have rewarded me. :D
That's a pretty broad generalisation. Which distro's are you meaning?
On a related topic, Ubuntu has an optional package that can be enabled to automatically restart the various systemd components that need it after their dependencies have been upgraded. From memory, that's specifically so people don't have to reboot unless it's really needed.
I don't remember the name of the package off hand though, but someone else here might... :)
I don't know if you're young or don't know much about history but what you describe is a fairly recent way of looking at things, it's not the only one and I guarantee you it will become "out of fashion".
And it makes a lot of sense because if uptime is that important, then no matter how fancy the hardware it can't do anything about disasters or losing internet connectivity.
I have ~1000 7002 cores in my home DC (8 dual socket R7525s with 48-64 cores each) that run kubernetes but are connected to a battery backup and use kexec to perform upgrades. So, while I am very bought into the cattle not pets philosophy, it's rare that any of these machines need to be turned off and I could see them being on for three years continuously without problem otherwise.
Pretty much why Pawsey has an Annual High Voltage inspection shutdown [1]
> Otherwise you're liable to find things like [..]
TBH that's not really been an issue of note at any of the big iron farms I've been around since the 1980s .. generally there's a disciplined approach to maintaining 24/7/365 operation (that includes scheduled downtime for equipment checks) part of which is process documentation and justification and soft means of freezing | migrating processes+data etc.
This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.
A motherboard update from MSI applied something from AMD and that fixed the issue.
I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...
As a systems seller you get most of the markup but also most of the responsibility, so handwaving 'sorry AMD fucked up' won't do it. You know have an installed base that might crash every 1024 days, which for unattended systems is long but not that long. Worse if you have hardware redundancy, there's still a chance they all booted around the same time so will crash around the same time.
Customers will be proactive and follow the intelligent periodic reboot schedule you propose for a time (see the 787 overflow bugs stories), while asking for a fix. The fix needs to still be OK with all the specs you sold. If one of these specs depends on sleep states, you'll have to find a solution around it and deploy it fleetwide. If a microcode update fixes it, yay. If the problem can't be winked away with a software patch, now the blast radius is bigger and you're still supposed to do as much as possible to use the least energy possible in most idle states...
The cardiac pacemaker (as in the tissue that sets the heart rate) is redundant. There's a primary and a secondary, and both are made of many cells which can take some damage and the entire system will still work.
I mean you could use it in a workstation, but unless you need 4 video cards locally it's probably overkill for most uses.
And a workstation should have no problem rebooting once in a while.
So no, it’s not going to start randomly hitting people.
> Seems comparably problematic to me.
Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.
They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.
Not sure about others
Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.
Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?
It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.
Exceptions definitely exist, but the workarounds are both pretty straightforward and you can pick whichever is less impactful.