July 2024 Update on Instability Reports on Intel Core 13th/14th Gen Desktop CPUs

July 2024 Update on Instability Reports on Intel Core 13th/14th Gen Desktop CPUs(community.intel.com)

327 points by acrispino 1 year ago | 208 comments

phire 1 year ago |

I find it hard to believe that it actually is a microcode issue.

Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.

The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.

This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.

I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.

RedShift1 1 year ago | |

As I understand it, there are multiple voltages inside the CPU, so just monitoring the motherboard VRM won't cut it.

That said I too am very skeptical. I just issued a moratorium on the purchase of anything Intel 13th/14th gen in our company and waiting for some actual proof that the issue is fully resolved.

phire 1 year ago | | |

It's complicated.

On Raptor lake, there are a few integrated voltage regulators to which provide new voltages for specialised uses (like the E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the current draw on those regulators is pretty low. The bulk of the power comes directly from motherboard VRMs on one of several rails with no internal regulation. Most of the power draw is grouped onto just two rails, VccGT for the GPU, and VccCore (also known as VccIA in other generations) which powers all the P-cores, all the E-cores and, the ring bus and the last-level cache.

Which means all cores share the same voltage, and it's trivial to monitor externally.

I guess it's possible the bug could be with only of the integrated voltage regulators, but those seem to only power various IO devices, and I struggle to see how they could trigger this type of instability.

gwbas1c 1 year ago | |

It's most likely both a hardware issue and a microcode issue.

Making CPUs is kind-of like sorting eggs. When they're made, they all have slightly different characteristics and get placed into bins (IE, "binned") based on how they meet the specs.

To oversimplify, the cough "better" chips are sold at higher prices because they can run at higher clock speeds and/or handle higher voltages. If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.

In this case, this is most likely an edge case that would not be a defect if shipping microcode already handled it. (Although it is appropriate to ask if it would result in effected chips going into a lower-price bin if they are effected.)

nequo 1 year ago | | |

> If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.

Do you mean that if a 13900KS CPU has a manufacturing defect, it gets downgraded and sold as 13900F or something else according to the nature of the defect?

jfindley 1 year ago | |

The months of R&D to create a workaround could simply be because the subset of motherboards which trigger this issue are doing something borderline/unexpected with their voltage management, and finding a workaround for that behaviour in CPU microcode is non-trivial. Not all motherboard models appear to trigger the fault, which suggests that motherboard behaviour is at least a contributing factor to the problem.

ploxiln 1 year ago | | |

I think this issue was sort of cracked-open and popularized recently by this particular video from Level1Techs: https://www.youtube.com/watch?v=QzHcrbT5D_Y

Towards the middle of the video it brings up some very interesting evidence, from online game server farms that use 13900 and 14900 variants for their high single-core performance for the cost, but with server-grade motherboards and chipsets that do not do any overclocking, and would be considered "conservative". But these environments show a very high statistical failure rate for these particular CPU models. This suggests that some high percentage of CPUs produced are affected, and it's long run-time over which the problem can develop, not just enthusiast/gamer motherboards pushing high power levels.

starspangled 1 year ago | |

All modern CPUs come out of the factory with many many bugs. The errata you see published are only the ones that they find after shipping (if you're lucky, they might not even publish all errata). Many bugs are fixed in testing and qualification before shipping.

That's how CPU design goes. The way that is done is by pushing as much to firmware as possible, adding chicken switches and fallback paths, and all sorts of ways to intercept regular operation and replace it with some trap to microcode or flush or degraded operation.

Applying fixes and workaround might cost quite a bit of performance (think spectre disabling of some kinds of branch predictors for an obvious very big one). And in some cases you even see in published errata they leave some theoretical correctness bugs unfixed entirely. Where is the line before accepting returns? Very blurry and unclear.

Almost certainly, huge parts of their voltage regulation (which goes along with frequency, thermal, and logic throttling) will be highly configurable. Quite likely it's run by entirely programmable microcontrollers on chip. Things that are baked into silicon might be voltage/droop sensors, temperature sensors, etc., and those could behave unexpectedly, although even then there might be redundancy or ways to compensate for small errors.

I don't see they "passed it off" as a microcode issue, just said that a microcode patch could fix it. As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue". Most things can be fixed with firmware/microcode patches, by design. And many things are. For example if some voltage sensor circuit on the chip behaved a bit differently than expected in the design but they could correct it by adding some offsets to a table, then the "issue" is that silicon deviates from the model / design and that can not be changed, but firmware update would be a perfectly good fix, to the point they might never bother to redo the sensor even if they were doing a new spin of the masks.

On the voltage issue, they did not say it was requesting an out of spec voltage, they said it was incorrect. This is not necessarily detectable out of context. Dynamic voltage and frequency scaling and all the analog issues that go with it are fiendishly complicated, voltage requested from a regulator is not what gets seen at any given component of the chip, loads, switching, capacitance, frequency, temperature, etc., can all conspire to change these things. And modern CPUs run as close to absolute minimum voltage/timing guard bands as possible to improve efficiency, and they boost up to as high voltages as they can to increase performance. A small bug or error in some characterization data in this very complicated algorithm of many variables and large multi dimensional tables could easily cause voltage/timing to go out of spec and cause instability. And it does not necessarily leave some nice log you can debug because you can't measure voltage from all billion components in the chip on a continuous basis.

And some bugs just take a while to find and fix. I'm not a tester per se but I found a logic bug in a CPU (not Intel but commercial CPU) that was quickly reproducible and resulted in a very hard lockup of a unit in the core, but it still took weeks to find it. Imagine some ephemeral analog bug lurking in a dusty corner of their operating envelope.

Then you actually have to develop the fix, then you have to run that fix through quite a rigorous testing process and get reasonable confidence that it solves the problem, before you would even make this announcement to say you've solved it. Add N more weeks for that.

So, not to say a dishonest or bad motivation from Intel is out of the question. But it seems impossible to make such speculations from the information we have. This announcement would be quite believable to me.

ChoGGi 1 year ago | | |

I agree with most of what you said, so cherry picking one thingy to reply to isn't my intention, but

"And some bugs just take a while to find and fix."

I think it's less that it took awhile to find the bug/etc, more so that they've been pretty much radio silent for six months. AMD had the issue with burning 7 series CPUs, they were quick to at least put out a statement that they'll make customers whole again.

sqeaky 1 year ago | | |

> As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue

They claimed:

> a microcode algorithm resulting in incorrect voltage requests to the processor.

worthless-trash 1 year ago | |

I believe that the waters may be muddied enough that they wont have to do a full recall and only if you 'provide evidence' the system is still crashing.

burnte 1 year ago | |

> I find it hard to believe that it actually is a microcode issue.

They learned a lot from the Pentium disaster, even if it's a hardware issue, they can address it with microcode at least, which is just as good.

yencabulator 1 year ago | | |

Except normally the result of a microcode workaround is that the chip no longer performs at its claimed/previously-measured level. Not "as good" by any standard.

For example, Intel CPU + Spectre mitigation is not "as good" as a CPU that didn't have the vulnerability in the first place.

HeliumHydride 1 year ago |

https://scholar.harvard.edu/files/mickens/files/theslowwinte...

"Unfortunately for John, the branches made a pact with Satan and quantum mechanics [...] In exchange for their last remaining bits of entropy, the branches cast evil spells on future genera- tions of processors. Those evil spells had names like “scaling- induced voltage leaks” and “increasing levels of waste heat” [...] the branches, those vanquished foes from long ago, would have the last laugh."

"John was terrified by the collapse of the parallelism bubble, and he quickly discarded his plans for a 743-core processor that was dubbed The Hydra of Destiny and whose abstract Platonic ideal was briefly the third-best chess player in Gary, Indiana. Clutching a bottle of whiskey in one hand and a shot- gun in the other, John scoured the research literature for ideas that might save his dreams of infinite scaling. He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROP- ERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks- and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs."

mattnewton 1 year ago | |

I haven't read this piece before but I just knew it was going to be written by Mickens about halfway through your comment.

throwup238 1 year ago | | |

The "mickens" in the URL on the first line was a dead giveaway :-)

yieldcrv 1 year ago | |

> According to my dad, flying in airplanes used to be fun... Everybody was attractive ....

this is how I feel about electric car supercharging stations at the moment. There is a definitely a privilege aspect, which some attractive people are beneficiaries of in a predictable way, as well as other expensive maintenance for their health and attraction.

so I could see myself saying the same thing to my children

tux3 1 year ago |

Remains to be seen how the microcode patch affects performance, and how these CPUs that have been affected by over-voltage to the point of instability will have aged in 6 months, or a few years from now.

More voltage generally improves stability, because there is more slack to close timing. Instability with high voltage suggests dangerous levels. A software patch can lower the voltage from this point on, but it can't take back any accumulated fatigue.

tpurves 1 year ago |

I think it's telling that they are delaying the microcode patch until after all the reviewers publish their Zen5 reviews and the comparisons of those chips against current Raptorlake performance.

zenonu 1 year ago | |

Why even publish a comparison? Raptor Lake processors aren't a functioning product to benchmark against.

AnthonyMouse 1 year ago | | |

Because the benchmarks will still exist on the sites after the microcode is released and a lot of the sites won't bother to go back and update them with the accurate performance level.

tankenmate 1 year ago | | |

Because if publishers don't publish then they don't make money.

userbinator 1 year ago |

Reminds me of Sudden Northwood Death Syndrome, 2002.

Looks like history may be repeating itself, or at least rhyming somewhat.

Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.

Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.

Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.

Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.

magicalhippo 1 year ago |

There was recently[1] some talk about how the 13th/14th gen mobile chips also had similar issues, though Intel insisted it's something else.

Will be interesting to see how that pans out.

[1]: https://news.ycombinator.com/item?id=41026123

TazeTSchnitzel 1 year ago |

After watching https://youtube.com/watch?v=gTeubeCIwRw and some related content, I personally don't believe it's an issue fixable with microcode. I guess we'll see.

jpk 1 year ago | |

Because HN doesn't provide link previews, I'd recommend adding some information about the content to your comment. Otherwise we have to click through to YouTube for the comment to make any sense.

That said, the video is the GamersNexus one where they talk about an unverified claim that this is a fabrication process issue caused by oxidation between atomic deposition layers. If that's the case, then yeah, microcode can only do so much. But like Steve says in the video, the oxidation theory has yet to be proven and they're just reporting what they have so far ahead of the Zen 5 reviews coming soon.

mananaysiempre 1 year ago | | |

GN mentioned shipping a few samples to a lab (number dependent on the price quote from said lab), so I hope we’ll have some closure regarding this hypothesis.

mjevans 1 year ago | | |

Hopefully Intel ships them, and allows them to, test and publish benchmarks with the current pre-release microcode revision for review comparison.

wnevets 1 year ago |

Are the CPUs that received elevated operating voltage permanently damaged?

Pet_Ant 1 year ago | |

This is the most pressing question. If it was just a microcode issue a cooloff and power cycle ought to at least reset things but according to Wendel from Level 1 Tech, that doesn't seem to always be the case.

kevingadd 1 year ago | | |

The problem is that running at too high of a voltage for sustained periods can cause physical degradation of the chip in some cases. Hopefully not here!

layer8 1 year ago | |

Not instantly it seems, but there have been reports of degradation over time. It will be a case-by-case thing.

userbinator 1 year ago | |

Possible electromigration damage, yes.

Covzire 1 year ago |

Just want to say, I'm incredibly happy with my 7800X3D. It runs ~70C max like Intel chips used to and with a $35 air cooler and it's on average the fastest chip for gaming workloads right now.

amiga-workbench 1 year ago | |

I'm also very happy with my 5800X3D, it was wonderful value back when AM5 had just released and DDR5/Motherboards still cost an arm and a leg.

The energy efficiency is much appreciated in the UK with our absurd price of electricity.

SushiHippie 1 year ago | | |

Same, in my BIOS I can activate a "ECO Mode", which lets me decide if I want to run my 7950x on full 170W TDP, 105W TDP or 60W TDP.

I benchmarked it, the difference between 170 and 105 is basically zero, and the difference to 60W is just a few percent of a performance hit, but way worth it, as it's ~0.3€/kWh over here.

NBJack 1 year ago |

I was concerned this would happen to them, given how much power was being pushed through their chips to keep them competitive. I get the impression their innovation has either truly slowed down, or AMD thought enough 'moves' ahead with their tech/marketing/patents to paint them into a corner.

I don't think Intel is done though, at least not yet.

brynet 1 year ago |

Curious why Intel announced this on their community forums, rather than somewhere more official.

guywithahat 1 year ago | |

That’s probably where people are mostly likely to understand it. A lot of companies do this, especially while they’re still learning things.

wmf 1 year ago | | |

These days people are more likely to see the announcement on YouTube, TikTok, or Twitter.

samtheprogram 1 year ago | |

Optics / stock price

beart 1 year ago | |

Based on what I know about corporations, it's entirely plausible that the folks posting the information don't actually have access to the communication channels you are referring to. I don't even know how I would issue an official communication at my own company if the need ever came up... so you go with what you have.

langsoul-com 1 year ago | |

Note how they mentioned its still going to be tested with various partners before released.

Ie we think this might solve it, but if it doesn't we can roll back with the least amount of PR attention.

christkv 1 year ago |

The amount of current their chips pull on full boost is pretty crazy. It would definitively not surprise me if some could get damaged by extensive boosting.

cdchn 1 year ago |

I built a system last fall with an i9-13900K and have been having the weirdest crashing problems with certain games that I never had problems with before. NEVER been able to track it down, no thermal issues, no overclocking, all updated drivers and BIOS. Maybe this is finally the answer I've been looking for.

EricE 1 year ago | |

It was for me. Check for BIOS updates - most motherboard vendors have them. Look for and enable something labeled Intel Baseline Profile and then check. That cured it for me.

For Asus: https://www.pcgamer.com/hardware/motherboards/asus-adds-inte...

cdchn 1 year ago | | |

I'll try that, thanks. Although the current cohort of games I play seems more stable now. If I ever go back to EVE Online then it'd be more of an issue - that thing crashed constantly.

uticus 1 year ago |

Dumb question: let’s say I am in charge of procurement for a significant amount of machines, do I not have the option of ordering machines from three generations back? Are older (proven reliable) processors just not available because they’re no longer made, like my 1989 Camry?

wmf 1 year ago | |

Yeah, 12th gen is probably still available.

firebaze 1 year ago |

Nice that Intel acknowledges there are problems with that CPU generation. If I read this right, the CPUs have been supplied with a too-high voltage across the board, with some tolerating the higher voltages for longer, others not so much.

Curious to see how this develops in terms of fixing defective silicon.

nubinetwork 1 year ago |

They already tried bios updates when they pushed out the "intel defaults" a couple months ago...

tedunangst 1 year ago | |

Except they didn't. https://www.pcworld.com/article/2326812/intel-is-not-recomme...

wmf 1 year ago | |

Firmware and microcode aren't the same thing.

jeffbee 1 year ago | | |

Very true and that's why it is odd that microcode has been mentioned here. Surely they mean PCU software (Pcode), or code for whatever they are calling the PCU these days.

nicman23 1 year ago | | |

firmware can include microcode though

PedroBatista 1 year ago |

Good for Intel to finally "figure it out" but I'm not 100% sure microcode is 100% of the problem. As in everything complex enough, the "problem" can actually be many compounded problems, MB vendors "special" tune comes to mind.

But this is already a mess very hard to clean since I feel many of these CPUs will die in an year or 2 because of these problems today but by then nobody will remember this and an RMA will be "difficult" to say the least.

johnklos 1 year ago | |

You're right - at least partly. If the issue is that Intel was too aggressive with voltages, they can use microcode updates as 1) an excuse to rejigger the power levels and voltages the BIOS uses as part of the update, and 2) they can have the processor itself be more conservative with the voltages and clocking it calculates itself.

Anything Intel announces, in my experience, is half true, so I'm interested to see what's actually true and what Intel will just forget to mention or will outright hide.

Havoc 1 year ago |

> Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages.

That’s great news for intel. If that’s correct. If not that’ll be a PR bloodbath

salamo 1 year ago |

Is there any info on how to diagnose this problem? Having just put together a computer with the 14900KF, I really don't want to swap it out if not necessary.

ChoGGi 1 year ago |

Hmm, mid August is after the new Ryzens are out, I wonder how bad of a performance hit this microcode update will bring?

And will it actually fix the issue?

https://www.youtube.com/watch?v=QzHcrbT5D_Y

ChrisArchitect 1 year ago |

(updated from other post about mobile crashes)

Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up

https://news.ycombinator.com/item?id=40962736

Intel is selling defective 13-14th Gen CPUs

https://news.ycombinator.com/item?id=40946644

Intel's woes with Core i9 CPUs crashing look worse than we thought

https://news.ycombinator.com/item?id=40954500

Warframe devs report 80% of game crashes happen on Intel's Core i9 chips

https://news.ycombinator.com/item?id=40961637

silisili 1 year ago | |

That one is mobile, this one is desktop, which they claim are different causes.

tedunangst 1 year ago | |

Not a dupe.

whalesalad 1 year ago |

If I didn’t just recently invest in 128gb of DDR4 I’d jump ship to AMD/AM5. My 13900k has been (knock on wood) solid though - with 24/7 uptime since July 2023.

thangngoc89 1 year ago | |

I guess you’re lucky. I own 2 machines for small scale CNN training, one 13900k and one 14900k. I have to throttle the CPU performances to 90% for stable running. This cost me about 1 hour / 100 hours of training.

whalesalad 1 year ago | | |

Are you using any motherboard overclocking stuff? A lot of mobo’s are pushing these chips pretty hard right out of the box.

I have mine at a factory setting that Intel would suggest, not the asus multi core enhancement crap. noctua dh15 cooler. It’s really been a stable setup.

J_Shelby_J 1 year ago | |

I evaluated ddr4 vs ddr5 a year ago, and it wasn’t worth it. Chasing FPS and the cost to hit the same speed in ddr5 was just too high, and I’m glad I did. I’m on a 13700k and I’m also very stable. However, with the stock XMP profile for my ram I was very much not stable and getting errors and bsods within minutes on an occp burn in test. All I had to do was roll back the memory clock speed a few hundred mhz.

eigenform 1 year ago |

by "microcode" i assume they meant "pcode" for the PCU? (but they decided not to make that distinction here for whatever reason?)

Night_Thastus 1 year ago |

"Elevated operating voltage" my foot.

We've already seen examples of this happening on non-OC'd server-style motherboards that perfectly adhere to the intel spec. This isn't like ASUS going 'hur dur 20% more voltage' and frying chips. If that's all it was it would be obvious.

Lowering voltage may help mitigate the problem, but it sure as shit isn't the cause.

sirn 1 year ago | |

It's worth noting that W680 boards are not a server board, they're a workstation board, and often times they're overclockable (or even overclocked by default). Wendell actually showed the other day that the ASUS W680 board was feeding 253W into a 35W (106W boost) 13700T CPU by default[1].

Supermicro and ASRock Rack do sell W680 as a server (because it took Intel a really long time to release C266), but while they're strictly to the spec, some boards are really not meant for K CPUs. For example, the Supermicro MBI-311A-1T2N is only certified for a non-TVB E/T CPUs, and trying to run the K CPU on these can result in the board plumbing 1.55V into the CPU during the single core load (where 1.4V would already be on the higher side)[2].

In this particular case, the "non-OC'd server-style motherboard" doesn't really mean anything (even more so in the context of this announcement).

[1]: https://x.com/tekwendell/status/1814329015773086069

[2]: https://x.com/Buildzoid1/status/1814520745810100666

dwattttt 1 year ago | |

They also admit a microcode algorithm produces incorrect requests for voltages, it doesn't sound like they're trying to shift the blame; ASUS doesn't write that microcode

paulmd 1 year ago | |

Specifically I think the concerns are around idle voltage and overshoot at this point, which is indeed something configured by OEMs.

edit: BZ just put out a video talking about running Minecraft servers destroying CPUs reliably, topping out at 83C, normally in the 50s, running 3600 speeds. Which is a clear issue with low-thread loads.

https://m.youtube.com/watch?v=yYfBxmBfq7k

acrispino 1 year ago |

An Intel employee is posting on reddit: https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...

A recent YouTube video by GamersNexus speculated the cause of instability might be a manufacturing issue. The employee's response follows.

Questions about manufacturing or Via Oxidation as reported by Tech outlets:

Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.

Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.

For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed

hsbauauvhabzb 1 year ago | |

So they were producing defective CPUs, identified & addressed the issue but didn’t issue a recall, defect notice or public statement relating to the issue?

Good to know.

Dylan16807 1 year ago | | |

It sounds like their analysis is that the oxidation issue is comfortably below the level of "defective".

No product will ever be perfect. You don't need to do a recall for a sufficiently rare problem.

And in case anyone skims, I will be extra clear, this is based on the claim that the oxidation is separate from the real problem here.

wslh 1 year ago | | |

It is the Pentium FDIV drama all over again! [1]. It is even in chapter 4 of the Andrew Grove's book!

[1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug

thelastparadise 1 year ago | | |

Dude's gonna be canned so hard.

loufe 1 year ago |

Intel cannot afford to be anything but outstanding in terms of customer experience right now. They are getting assaulted on all fronts and need to do a lot to improve their image to stay competitive.

scrlk 1 year ago | |

Intel should take a page out of HP's book when it came to dealing with a bug in the HP-35 (first pocket scientific calculator):

> The HP-35 had numerical algorithms that exceeded the precision of most mainframe computers at the time. During development, Dave Cochran, who was in charge of the algorithms, tried to use a Burroughs B5500 to validate the results of the HP-35 but instead found too little precision in the former to continue. IBM mainframes also didn't measure up. This forced time-consuming manual comparisons of results to mathematical tables. A few bugs got through this process. For example: 2.02 ln ex resulted in 2 rather than 2.02. When the bug was discovered, HP had already sold 25,000 units which was a huge volume for the company. In a meeting, Dave Packard asked what they were going to do about the units already in the field and someone in the crowd said "Don't tell?" At this Packard's pencil snapped and he said: "Who said that? We're going to tell everyone and offer them, a replacement. It would be better to never make a dime of profit than to have a product out there with a problem". It turns out that less than a quarter of the units were returned. Most people preferred to keep their buggy calculator and the notice from HP offering the replacement.

https://www.hpmuseum.org/hp35.htm

basementcat 1 year ago | | |

I wonder if Mr. Packard's answer would have been different if a recall would have bankrupted the company or necessitated layoff of a substantial percentage of staff.

Joel_Mckay 1 year ago | |

Their acquisition of Altera seemed to harm both companies irreparably.

Any company can reach a state where the Process people take over, and the Product people end up at other firms.

Intel could have grown a pair, and spun the 32 core RISC-V DSP SoC + gpu for mobile... but there is little business incentive to do so.

Like any rotting whale, they will be stinking up the place for a long time yet. =)

beacon294 1 year ago | | |

Could you elaborate on the process people versus product people?

xyst 1 year ago |

Wonder what Linus has to say on this. Dude knows how to rip into crappy Intel products

weberer 1 year ago | |

Torvalds or the Youtube guy?

happosai 1 year ago | | |

Yes

fefe23 1 year ago |

So on one hand they are saying it's voltage (i.e. something external, not their fault, bad mainboard manufacturers!).

On the other hand they are saying they will fix it in microcode. How is that even possible?

Are they saying that their CPUs are signaling the mainboards to give them too much voltage?

Can someone make sense of this? It reminds me of Steve Jobs' You Are Holding It Wrong moment.