Why isn't AMD's MI300X competitive?

Why isn't AMD's MI300X competitive?(newsletter.semianalysis.com)

46 points by colonCapitalDee 65 days ago | 36 comments

oneofthose 63 days ago |

> AMD’s software experience is riddled with bugs [...] AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.

This has anecdotally been true since forever. Back in the day, OpenCL implementations were passing conformance test but performance was poor. They could not turn hardware capabilities into performance for compute users. Drivers were buggy. Documentation was poor compared to NVidia's docs and forum. Offerings were inconsistent (look up Sycl from Codeplay) and ownership of what it is like to develop for AMD was unclear. The notion that it might not have improved or is only now improving is puzzling. It can't be for the lack of recognizing the problem. Intuitively it does not seem so difficult. I'm curious what the reasons are.

dragandj 63 days ago | |

FWIW Back in 2015 OpenCL 2.0 performance was quite good on then-current AMD GPUs (IMO), but the problem was that 1. You had to implement everything yourself, from scratch, since AMD's GPU BLAS was barely compilable, and 2. They abandoned OpenCL that year, and switched to HIP (or whatever their copy of CUDA was called) which didn't even compile (in practice) for quite some time, and 3. Even with HIP, you were on your own when it comes for any BLAS and other standard library implementations because AMD provided nothing of the sorts for a long time.

All in all, it's not that the drivers performance was poor per se, but AMD did nothing about providing a software ecosystem, which amount to its hardware wasn't realistically usable unless your pockets were so big that you can do AMD's job and fund the re-development of the whole ecosystem from scratch.

In other words, it made MUCH better ROI to just use Nvidia, pay a little bit more for the hardware, and save millions on software :)

sorenjan 63 days ago | | |

Cuda also compiles to PTX, which makes it much easier to distribute and therefore also easier for users to actually use. Doesn't matter that much when you're writing code for specific hardware like MI300X, but it's part of the developer story.

hgoel 63 days ago | |

I think it's useful to consider that NVIDIA bet on CUDA early, they've supported it since 2006. AMD has to do a lot less work, but it's still going to take a while to get all of the software in a competitive state.

Though, on the other hand, I'm not very convinced AMD is even seriously trying, with how much of a mess ROCm has continued to be. GCN was an excellent GPU compute architecture, but they never seemed to manage to make much of that.

I had been willing to put up with the software support struggles too, but the way ROCm support for the Radeon VII and 5000 series had been handled really put me off.

shrubble 63 days ago | |

Even before ATI was acquired by AMD they had driver support problems.

When I was working for a Unix commercial graphics software company, the CTO told me how bad the information he received under ATI’s NDA was: different revisions of the same chipset had contradictory register settings, so the driver had to identify the revision before writing a value to the write-only configuration registers. The same chipset might need a 0 or a 1; writing the wrong values could crash the driver.

WhyNotHugo 63 days ago | |

> AMD’s software experience is riddled with bugs [...] AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.

This was even true ~2005 in gaming circles. AMD drivers were buggy, so even when their cards were more performant at the same price point, folks opted for NVIDIA for reliability.

lmm 62 days ago | |

Cultural change in large organisations is hard to impossible. Like most hardware manufacturers, AMD culturally does not value software and never has. I've said before that it's not at all strange that AMD's drivers suck, the thing that's remarkable is that Nvidia has somehow managed to build a good software engineering culture that releases good drivers and libraries (at least relatively speaking).

andy_ppp 63 days ago |

Please just get everything in PyTorch to work, and work well (and across all graphics cards too). This is the starting point and it doesn't matter how you do it. But the fact you cannot even do some very basic stuff on AMD is going to mean you are left unused by researchers, so getting further up the stack is going to be almost impossible.

roenxi 63 days ago | |

Does PyTorch not work on AMD cards? I remain very nervous about returning to the AMD ecosystem. On paper AMD has been a compelling choice for GPGPU work for years, up until it turns out the hardware can't actually do what it claims. But the PyTorch problem seemed to be largely solved years. The issues weren't on the application layer, it was crippling firmware bugs that they didn't seem interested in getting a handle on. PyTorch ran fine until the computer kernel paniced or whatever, but that isn't a PyTorch problem.

joelthelion 63 days ago | |

The problem is "just". "Just" getting pytorch to work and to work well is a huge undertaking.

andy_ppp 63 days ago | | |

Just, in this case means “at minimum” or “first and foremost, no excuses”. I obviously understand this is a huge undertaking. Nobody said attempting to be competitive with NVIDIA in AI would be a walk in the park.

blitzar 63 days ago | | |

for a trillion dollars, they should be able to figure it out.

Havoc 63 days ago |

Mirrors the geohotz rants about AMD at the time, though as others point out this - 2024 - is ancient news in AI world and not quite sure what value it adds to the current discussions

tripledry 63 days ago | |

Has this changed, If I want to go hands on with development using pytorch or whatever is used now, would you recommend an AMD card?

Genuine question, I have not followed this topic closely for years :)

c0balt 63 days ago | | |

Short answer, no.

There are better learning resources and a better ecosystem available around Nvidia cards & software (cuda).

Long answer, it depends. It will add more challenges and require significantly more effort (even outside the GPU programming itself, debugging toolchain etc. is a somewhat separate skill). The smaller/less mature ecosystem also means you will have less examples to look at for references.

lhl 63 days ago | | |

RDNA is a whole different (and much poorer supported) animal than CDNA. As someone with extensive experience in both, if you're asking the question, then, no.

(If you're just looking to learn, use the free Kaggle/Google Cola T4s/TPUs to get started.)

Havoc 63 days ago | | |

Still rocking a 3090 so can't speak from experience but general vibe around simple at home inference seems like it has improved (esp since both vulkan and rocm are now viable paths on newer cards).

>development using pytorch

Would probably still play it nvidia safe for more adventurous stuff than token generation even if it has improved

ZiiS 63 days ago |

Correction: Why wasn't it competitive 2 years ago; basically half the AI summer ago.

pstuart 65 days ago |

If AMD's betting the company on their AI compute, they had best follow the advice in the article because the only way to compete with NVIDIA is to meet/exceed not just the performance but also the DevX.

dingdingdang 63 days ago | |

These days it's for sure the dev environment that is lacking, hardware is okay (potentially great?!), software abysmal. To run a local llm in a stable manner implies using Vulkan.. any attempt at ROCm is totally hamstrung by haphazard support of hardware alongside with an online presence poisoned by people primarily discussing work-arounds rather than work when it comes to AMD as a platform. Argh.

HarHarVeryFunny 63 days ago | | |

Is there any benefit of Vulcan vs ROCm on a card where ROCm is fully supported?

KeplerBoy 63 days ago | |

You can't have good performance without good DevX. There's a reason why we get a new python dsl for nvidia GPUs every week.

fancyfredbot 63 days ago |

Please amend the title, this is a December 2024 article and the conclusions are misleading in 2026

geremiiah 63 days ago |

I wonder if hiring is a big factor here. I presume, all the really good systems+parallel programmers would rather gain more experience on NVIDIA hardware than AMD, so given the choice, they'd go with NVIDIA. Does AMD do enough to win them over?

_aavaa_ 65 days ago |

[2024]

pilililo2 63 days ago |

This is from more than 2 years ago, why post this now?

DiabloD3 65 days ago |

I love how they just butcher that article.

I remember when it came out a little over a year ago, and its just as wrong as it is today as it was then.

threepts 63 days ago |

NVIDIA has such a big moat around their CUDA architecture such that I don't think AMD will ever be able to outcompete them in AI compute unless they somehow find 2-3 nobel prize level breakthroughs today.

WhyNotHugo 63 days ago | |

I don’t think building a separate moat is ever going to work. They don’t just have to catch up and surpass in performance, but also justify customers moving out of one silo and into another vendor-lock in.

Sweepi 63 days ago |

please add [2024]

any current articles on that topic?

arka2147483647 63 days ago |

The important part of Hardware, is Software

After all, if the Software does not work, its just a Paperweight

wongarsu 63 days ago | |

And yet hardware companies with good software are the exception, not the norm. Is it just the cultural mismatch between hardware and software development life cycles and planning philosophies, or is there more to it?

wewewedxfgdf 63 days ago |

AMD just doesn't seem to be that good at software.

agunapal 63 days ago |

Nvidia had the first movers advantage. Nvidia spent so many years perfecting CUDA to work well with PyTorch. Before ROCM, there was only CUDA. There were so many developers building their use cases on top of PyTorch+CUDA, and bringing all that feedback to PyTorch, this made CUDA battle ready and stable. AMD can get there, especially now with demand for compute, but as someone already said here, the biggest focus needs to be on PyTorch