CPU Bugs (2018)

CPU Bugs (2018)(danluu.com)

98 points by for_xyz 5 years ago | 20 comments

The 8088 processor in the first IBM PC had a bug that gave me some grief.

(The code below is likely to have bugs of its own - I wrote it from memory as an illustration of the CPU bug - and thanks to 'tlb' for catching an error in my first draft. I also left out the question of what data segment the various MOV instructions use for their memory references, as it isn't relevant to this CPU bug.)

If you needed to work in a different stack from the one you were currently running on, you might do something like this:

  mov saveSP, sp
  mov sp, mySP
  ...
  mov sp, saveSP

This saves the original SP (Stack Pointer) register, loads it with your private value, and then restores SP when you are done.

Suppose you wanted to switch not only to your own stack pointer but also your own stack segment. With 16-bit registers you could only address 64KB at a time, and you would need to change a segment register to access memory outside that range.

So you would save, change, and restore both the SS (Stack Segment) and SP registers:

  mov saveSS, ss
  mov saveSP, sp
  mov ss, mySS
  mov sp, mySP
  ...
  mov ss, saveSS
  mov sp, saveSP

Now imagine that an interrupt triggered in between one of the changes to SS and the matching change to SP. The interrupt code would now be running on the new stack segment but the old stack pointer, corrupting memory and crashing.

Not to worry! Intel had your back. The documentation promised that after a MOV SS or POP SS, interrupts would automatically be disabled until the next instruction (the matching MOV SP or POP SP) completed.

But they kinda forgot to implement that feature. So if you followed the docs, you would have these very rare and intermittent crash bugs.

Word got around fairly soon, and the fix was simple enough, disable interrupts yourself around the paired instructions:

  mov saveSS, ss
  mov saveSP, sp
  cli
  mov ss, mySS
  mov sp, mySP
  sti
  ...
  cli
  mov ss, saveSS
  mov sp, saveSP
  sti

This still left you unprotected against NMI (Non-Maskable Interrupt), but by the time most of us built NMI switches for our IBM PC's, we'd also upgraded to newer CPUs with this bug fixed. It was only the earliest 8088s (and perhaps 8086s) that had the bug.

tlb 5 years ago | |

Why does the pop at the end of:

  push sp
  mov sp, myPrivateSP
  ...
  pop sp

work? Isn't it popping from the private stack, while it was pushed on the regular stack?

Stratoscope 5 years ago | | |

Oh, good catch! I was doing this from memory, and definitely have a bug there.

Updated now, hopefully this will be a more plausible example. Let me know if you spot something else! :-)

anonymousiam 5 years ago |

"As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently."

A most prescient remark in 2014.

Here's where they are more recently:

https://www.zdnet.com/article/intel-fixed-236-bugs-in-2019-a...

https://www.techradar.com/news/latest-intel-cpus-have-imposs...

Flow 5 years ago | |

When this news broke I though Intel lost their mind.

Did they really intend to just "skip" validation or did they try to automate it further, to decrease time to produce a new chip?

hulitu 5 years ago | | |

Testing is expensive. That's why it has a great potential for savings.

userbinator 5 years ago |

It's funny to hear that the bug increases are an effect of Intel trying to compete with ARM SoCs in mobile devices, because the errata those have are much worse --- and indeed a lot of embedded stuff is like that because the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.

amelius 5 years ago | |

> the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.

How does that work for Apple's M1?

saagarjha 5 years ago | | |

https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad95...

bombcar 5 years ago | |

Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.

jeffbee 5 years ago | | |

The updates are often fatal, though. These include things like the Opteron "Barcelona" TLB bug and the first-generation EPYC "Naples" frequency scaling bug. The fix for the former knocked 20% off the performance of that generation of parts, and the fix for the latter meant that you had to run at the base clock frequency at all times, getting neither turbo boosts nor power savings. If you apply all of the speculative execution workarounds to an older Intel part like Xeon E5 v3 you will lose something like a quarter of the performance you paid for originally.

amelius 5 years ago | | |

> Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.

This ignores the fact that there can be security exploits.

formerly_proven 5 years ago | | |

Are there ARM CPUs that have upgradeable microcode?