CUDA Is Still a Giant Moat for Nvidia

CUDA Is Still a Giant Moat for Nvidia(weightythoughts.com)

84 points by j-wang 2 years ago | 111 comments

Anyone following geohots current tinygrad struggles has seen this proven right in front of them. AMD gpus are practically unusable for any serious ML work and he had to learn it the hard way, having dropped 100K into AMD gpus, assuming the drivers would work and if not, he even personally offered to fix them.

ghxst 2 years ago | |

Not just geohot but AMD has a really amazing opportunity here to work with many other talented engineers that are more than willing to put resources into this if they had good documentation and tools to work with, it's really puzzling to me that AMD isn't taking this opportunity more seriously.

xiphias2 2 years ago | |

A shortened version of what was happening: the firmware / hardware is so bad that instead of fixing it the AMD team just added some restarts when the AMD card locks up (which happens all the time with computatiins), and even those restarts don't work, the whole computer had to be restarted.

This serious bug was open since May and AMD doesn't seem to respond as seriously as it should be.

KennyBlanken 2 years ago | |

Is this the same geohot that 9+ months ago declared he was "done with AMD"?

Isn't geohot infamous for stealing other people's work?

PBCAK?

That said, ROCm only officially supports a fraction of its product line, and an odd smattering throughout at that. It's a joke compared to CUDA which will run on damn near anything. And AMD has a long, long history of dogshit drivers (at least on Windows.)

AMD just doesn't seem to give enough of a shit to invest money into securing top talent for this, and NVIDIA will continue to stomp them.

ryandvm 2 years ago | | |

Yeah, the same guy that was going to single handedly fix Twitter's search for Elon and resigned after 4 weeks saying there was nothing he could do.

justinclift 2 years ago | | |

> Isn't geohot infamous for stealing other people's work?

Are you meaning the Sony Playstation hacking where they took legal action against him, or are you meaning other stuff?

xiphias2 2 years ago | | |

That same bug is still open, not fixed. Azure announced access to AMD GPU cloud with NDA, but the cards are unusable for compute work as they lock up randomly.

Buttons840 2 years ago | |

I saw him on Twitch today in passing, the title was about "ripping <something> out of AMD drivers" or similar, so it seems he's still at it.

i67vw3 2 years ago | |

AMD is a deeply unserious company. They could have made boatload of money for shareholders like Nvidia did, but the AMD management looks very bad to me.

Shareholders of AMD should look into it and do some firings of top Executives/CEO until morale improves.

DanielHB 2 years ago | | |

I think the problem AMD has is that they just don't have enough engineers and can't hire more because nvidia (an to a lesser extent Apple, AWS, Google and Microsoft) just gobbles up all people who have any experience with this sort of thing.

A long time ago AMD decided to 100% focus on budget consumer graphics (including consoles), that decision was the right decision at the time. However being in low-margin business it seems they don't have the people (or the budget to last-minute hire) to pump out the R&D for a generic neural network platform without moving people away from their consumer graphics division.

jgord 2 years ago |

I dont understand this - arent almost all ML NN models built in pytorch, and arent these compiled / jit'd into a lower level format - and can we not have various backends/drivers for that, such as CUDA / ROCM / vnni ?

The article is unsatisfying because it doesnt explain WHY cuda reigns supreme.

One hypothesis put forward is that the main alternative ROCM is just not very complete and not very fast - thats a good argument.

Another hypothesis that is not considered is : CUDA reigns supreme, because NVIDIA GPUs reign supreme.

But people dont write CUDA code .. they write pytorch code ?!

fancyfredbot 2 years ago |

I am not sure CUDA is the moat, but yes, software is the moat.

To first order nobody writes any CUDA, and even if you do you are probably bad at it. The language is slightly easier to use than openCL but writing really performant code is still a nightmare (a pipeline of asynchronous memory copies from global to shared memory is not easy to program but this is a requirement for full performance on tensor cores).

So no, the moat really isn't the language. It's not even the libraries, it's the integration of the libraries into third party software like pytorch, jax, etc. This is the truly massive advantage NVIDIA has, and they got it by being early and by being installed in an awful lot of machines.

sbierwagen 2 years ago |

If I was an AMD shareholder I'd seriously be considering a vote to remove CEO Lisa Su. They make nearly identical products to NVIDIA, yet that other company is worth literally ten times as much, because pytorch actually works on their cards. Why isn't she prioritizing firmware that doesn't crash?

david-gpu 2 years ago | |

> Why isn't she prioritizing firmware that doesn't crash?

I used to work in the GPU industry and this sort of view is both pervasive and misguided.

GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.

Because of this, and in spite of the amount of time and resources spent on validation and verification, the hardware often contains flaws. It is the responsibility of the drivers to work around these flaws in various ways. When a flaw hasn't been discovered and worked around yet, you perceive it as the GPU being unstable or crashing.

There is no fast simple solution to this. You need a finely tuned corporate machine from beginning to end. Better hiring processes, better management, better design processes, better verification processes, better software development practices, better marketing and sales, better customer relations. Everything.

imtringued 2 years ago | | |

>GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.

This is like saying combustion engines are immensely complex machines when your car suddenly loses power on the highway for no apparent reason and then when you restart the engine it works for another five minutes again. When you drive on normal roads it works flawlessly. It must be the engine, right? After all, it is the most complicated aspect!

Except in reality it is far more likely for it to be a problem in the electronics driving the fuel pump or spark plug.

AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm.

croes 2 years ago | |

You want to fire someone who helped getting AMD on top of Intel?

Pretty bad idea, especially in midst of the AI hype.

curt15 2 years ago | |

AMD has a CPU division too, and Zen basically resurrected AMD against Intel.

jejeyyy77 2 years ago | |

"Why isn't she prioritizing firmware that doesn't crash?"

why can't xyz company build apps/websites/products that don't have bugs??

mrbishalsaha 2 years ago |

Well deserved in my mind. Nvidia has been pushing the use of AI chips for far to long. The literally did everything possible to make it happen.

I believe LLMs will be commoditised while the compute power will be the next big thing.

chii 2 years ago | |

> Well deserved in my mind.

not if this moat could be leveraged into a monopoly on AI chips, to the detriment of society.

I want to see competition in this space.

Unfortunately, the market rally of nvidia stock is suggesting that most investors are expecting this monopoly to eventuate.

Therefore, it is in the interest of society to ensure that such a software moat is not established. Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.

anon291 2 years ago | | |

> Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.

Realistically what happened is that after a few decades of development, competitors arose and took the market. In the meantime, Microsoft became rich. Who cares

andsoitis 2 years ago | | |

if the prize is big enough, there will rise others.

aurareturn 2 years ago | |

>I believe LLMs will be commoditised while the compute power will be the next big thing.

Can you talk more about this? Would love to understand.

bsder 2 years ago |

CUDA is a moat because AMD and Intel are run by morons^W^W^W run by people who can't swallow the fact that software is more important than hardware.

Intel should be shoveling out 16GB Arc graphics cards for free to every graduate program in the country who can fill out a web form. In a couple years, they'd displace NVIDIA.

AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.

Yes, there has been progress. However, when you look at the amount of money that AMD and Intel throw at software vs how much NVIDIA throws at software, it's an instant facepalm moment.

NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.

beryilma 2 years ago |

I don't know much about CUDA and NVIDIA, but it has always surprised me how hardware companies are so bad at producing good software tooling for their hardware.

Many microcontroller companies have terrible software support: no free C/C++ compilers, clunky IDEs, too much reliance on 3rd party software providers, no decent code libraries...

Even if they have software support, the code is bad and bloated. Look at ST's HAL libraries, for example. Thankfully, an open source or free tool often comes to the rescue, usually through the efforts of dedicated individual programmers. But billion-dollar companies relying on such 3rd party tooling seems insane to me.

Havoc 2 years ago |

Blows my mind that AMD isnt throwing everything they’ve got at fixing this.

frozenport 2 years ago |

+1 on the AMD are morons train.

AMD recently got rid of one of the CUDA compatibility layers instead of extending it.

modeless 2 years ago | |

Chasing compatibility is a waste of time and ultimately counterproductive. The important software is open source, they can just add direct support for their stuff. What they need to do is fix the stability of their drivers, make their stuff work on every GPU they sell or have sold in the past few years (as CUDA always has), and pay employees to integrate support into all the popular open source projects while fixing every bug that gets reported.

And they need to release high-RAM versions of their next gaming GPUs. More than anything else that will incentivize people to switch. If they're selling 36 GB while Nvidia is still selling 24 GB, people will do what it takes to move over.

jjmarr 2 years ago | | |

> What they need to do is fix the stability of their drivers, make their stuff work on every GPU they sell or have sold in the past few years (as CUDA always has), and pay employees to integrate support into all the popular open source projects while fixing every bug that gets reported.

This takes a ton of employees which is hard for a company with a fraction of the software employees of Nvidia. (On that note there's 1185 engineering job postings on the AMD site right now... https://careers.amd.com/careers-home/jobs?categories=Enginee...)

pests 2 years ago | |

They didn't get rid of it, they dropped development and released it as open source.

steelbrain 2 years ago | | |

> they dropped development and released it as open source.

"They" (being AMD) didn't. The person they contracted put in a clause that allowed him to open source the work (years AFTER) AMD stopped paying him.

Alifatisk 2 years ago |

By reading all the comments here, everyone seem to agree on that AMD is betting on the wrong thing. Yet, they continue the same path.

- Abandoning ZLUDA was maybe not the best choice

- Not accepting the fact that software is equally as important as hardware is wrong

- Pushing more vram into their cards would attract more people

- Fix hardware issues (especially with the restarts on every fail) should be high priority

TMWNN 2 years ago |

How does Apple's Metal compare to/compete with CUDA? I know Ollama and LM Studio support Metal.

shmerl 2 years ago |

Lock-in should be broken. CUDA is one of the worst things about this whole ecosystem. Looks like AMD came close to breaking it, but they abandoned developing the translation layer.

anon291 2 years ago | |

Thinking that a cuda translation layer will take away Nvidias advantage is like expecting writing a c compiler to spontaneously result in unix

shmerl 2 years ago | | |

It would take away a huge chunk of their advantage, no doubt about it. Let Nvida compete on merit instead of lock-in. Then you can say their advantage lies in being better. But Nvidia is very lock-in oriented, which undermines the claim that they are so much better than everyone.

aurareturn 2 years ago |

  Chip War has a great section on how the Soviet Union tried a “just copy/steal” strategy in semiconductors and fell hopelessly behind because of it. It’s a great theoretical idea to just copy/steal and fast-follow, but semiconductors, AI, and other “harder technologies” require building human and intellectual capital that will get better with time. From there, you need to have the prior generation to keep up with ever-increasing complexity and difficulty as these things get more advanced.

I disagree with your section on Huawei and China. China isn't just trying to just copy/steal AI. In terms of models, China is a bit behind in LLMs but arguably more ahead in self-driving cars. China is throwing everything at semiconductor manufacturing instead because that's where their bottleneck truly is - not CUDA. Had Huawei had access to TSMC's 5nm and 3nm, they might already be equal to Nvidia in raw GPU prowess. After all, HiSilicon's Kirin already matched/exceeded Qualcomm before the Trump ban. Their 5G chips/implementation were well ahead of anyone else. In software, it's easier for China to adopt a CUDA alternative because China is usually really good at unifying under one vision - especially when they have to.