Why isn't AMD's MI300X competitive?(newsletter.semianalysis.com) |
Why isn't AMD's MI300X competitive?(newsletter.semianalysis.com) |
This has anecdotally been true since forever. Back in the day, OpenCL implementations were passing conformance test but performance was poor. They could not turn hardware capabilities into performance for compute users. Drivers were buggy. Documentation was poor compared to NVidia's docs and forum. Offerings were inconsistent (look up Sycl from Codeplay) and ownership of what it is like to develop for AMD was unclear. The notion that it might not have improved or is only now improving is puzzling. It can't be for the lack of recognizing the problem. Intuitively it does not seem so difficult. I'm curious what the reasons are.
All in all, it's not that the drivers performance was poor per se, but AMD did nothing about providing a software ecosystem, which amount to its hardware wasn't realistically usable unless your pockets were so big that you can do AMD's job and fund the re-development of the whole ecosystem from scratch.
In other words, it made MUCH better ROI to just use Nvidia, pay a little bit more for the hardware, and save millions on software :)
Though, on the other hand, I'm not very convinced AMD is even seriously trying, with how much of a mess ROCm has continued to be. GCN was an excellent GPU compute architecture, but they never seemed to manage to make much of that.
I had been willing to put up with the software support struggles too, but the way ROCm support for the Radeon VII and 5000 series had been handled really put me off.
When I was working for a Unix commercial graphics software company, the CTO told me how bad the information he received under ATI’s NDA was: different revisions of the same chipset had contradictory register settings, so the driver had to identify the revision before writing a value to the write-only configuration registers. The same chipset might need a 0 or a 1; writing the wrong values could crash the driver.
This was even true ~2005 in gaming circles. AMD drivers were buggy, so even when their cards were more performant at the same price point, folks opted for NVIDIA for reliability.
Genuine question, I have not followed this topic closely for years :)
There are better learning resources and a better ecosystem available around Nvidia cards & software (cuda).
Long answer, it depends. It will add more challenges and require significantly more effort (even outside the GPU programming itself, debugging toolchain etc. is a somewhat separate skill). The smaller/less mature ecosystem also means you will have less examples to look at for references.
(If you're just looking to learn, use the free Kaggle/Google Cola T4s/TPUs to get started.)
>development using pytorch
Would probably still play it nvidia safe for more adventurous stuff than token generation even if it has improved
I remember when it came out a little over a year ago, and its just as wrong as it is today as it was then.
any current articles on that topic?
After all, if the Software does not work, its just a Paperweight