AMD may get across the CUDA moat

AMD may get across the CUDA moat(hpcwire.com)

551 points by danzheng 2 years ago | 302 comments

omneity 2 years ago |

I was able to use ROCm recently with Pytorch and after pulling some hair it worked quite well. The Radeon GPU I had on hand was a bit old and underpowered (RDNA2) and it only supported matmul on fp64, but for the job I needed done I saw a 200x increase in it/s over CPU despite the need to cast everywhere, and that made me super happy.

Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility and to keep code simple.

Protip: Use the official ROCM Pytorch base docker image [0]. The AMD setup is so finicky and dependent on specific versions of sdk/drivers/libraries and it will be much harder to make work if you try to install them separately.

[0]: https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/p...

mikepurvis 2 years ago | |

Sigh. It's great that these container images exist to give people an easy on-ramp, but they definitely don't work for every use case (especially once you're in embedded where space matters and you might not be online to pull multi-gb updates from some registry).

So it's important that vendors don't feel let off the hook to provide sane packaging just because there's an option to use a kitchen-sink container image they rebuild every day from source.

xahrepap 2 years ago | | |

I know it's still different than what you're looking for, so you probably already know this, but many projects like this have the Dockerfile on github which shows exactly how they set up the image. For example:

https://github.com/RadeonOpenCompute/ROCm-docker/blob/master...

They also have some for Fedora. Looks like for this you need to install their repo:

    curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - \
    && printf "deb [arch=amd64] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" | tee /etc/apt/sources.list.d/rocm.list \
    && printf "deb [arch=amd64] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" | tee /etc/apt/sources.list.d/amdgpu.list \

then install Python, a couple other dependencies (build-essential, etc) and then the package in question: rocm-dev

So they are doing the packaging. There might even be documentation elsewhere for that type of setup.

fwsgonzo 2 years ago | | |

I feel the same way, especially about build systems. OpenSSL and v8 are among a large list of things that have horrid build systems. Only way to build them sanely is to use some randos CMake fork, then it Just Works. Literally a two-liner in your build system to add them to your project with a sane CMake script.

amelius 2 years ago | | |

> So it's important that vendors don't feel let off the hook to provide sane packaging just because there's an option to use a kitchen-sink container image they rebuild every day.

Sadly if e.g. 95% of their users can use the container, then it could make economical sense to do it that way.

mathisfun123 2 years ago | | |

> especially once you're in embedded

is this a real problem? exactly which embedded platform has a device that ROCm supports?

ngcc_hk 2 years ago | | |

Better to come if the tide shift so we can have compatible layer. The key is the tide. Obviously would n try to sue … it would be a sign that finally we have real competition. Gar is where innovation do.

X86 cannot do 64 bit let us do this and that so the market can use only our cpu. Repeat with me x86-64 is impossible.

Not sure Apple is in this otherwise the real great competition come.

wyldfire 2 years ago | |

> Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility

Man oh man where did we go wrong that cuda is the more compatible option over OpenCL?

KeplerBoy 2 years ago | | |

It must be a misnomer on PyTorch's side. Clearly it's neither CUDA nor OpenCL.

AMD should just get it's shit together. This is ridiculous. Not the name, but the fact that you can only do FP64 on a GPU. Everybody is moving to FP16 and AMD is stuck on doubles?

NavinF 2 years ago | | |

This has always been the case. OpenCL is a shit show

RockRobotRock 2 years ago | |

Have you gotten it to work with Whisper by any chance?

kkielhofner 2 years ago | | |

Whisper is actually a great example of why Nvidia has such a stronghold on ML/AI and why it’s so difficult to compete.

There’s getting something to “work”, which is often enough of a challenge with ROCm. Then there’s getting it to work well (next challenge).

Then there’s getting it to work as well as Nvidia/CUDA.

With Whisper, as one example, you should be running it with ctranslate2[0]. Of all the platforms on their supported list you won’t find ROCm.

When you really start to look around you’ll find that ROCm is (at best) still very much in the “get it to work (sometimes)” stage. In most cases it’s still a long way away from getting it to work well, and even further away from making it actually competitive with Nvidia for serious use cases and applications.

People get excited about the progress ROCm has made getting basic things to work with PyTorch and this is good - progress is progress. But saving 20% on the hardware when the equivalent Nvidia product is often somewhere between 5-10x as performant (at a fraction of the development time) because of vastly superior software support you realize pretty quickly Nvidia is actually a bargain compared to AMD.

I’m desperately rooting for Nvidia to have some actual competition but after six years of ROCm and my own repeated failed attempts to have it make any sense overall I’m only more and more skeptical that real competition in the space will come from AMD.

[0] - https://github.com/OpenNMT/CTranslate2

pedrovhb 2 years ago | | |

I've had luck with an RX5700XT and whisper.cpp built with clblast. Works like a charm, not entirely a scarring experience getting it to work (easier than most other stuff which was surprising to me).

One arcane detail is that whereas for PyTorch I have to set the env var HSA_OVERRIDE_GFX_VERSION to 10.3.0, getting it to run with whisper.cpp and llama.cpp requires setting it to 10.1.0. Good luck and may it cost you less hair than it did me.

incognition 2 years ago | |

Fp64??

latchkey 2 years ago | | |

https://en.wikipedia.org/wiki/Double-precision_floating-poin...

NVIDIA fp32 (H100) has 2x more TFLOPS than AMD's fp32 (MI250) and AI doesn't need fp64 precision.

fransje26 2 years ago | | |

Hardware limitation.

javchz 2 years ago |

CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.

Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.

IronWolve 2 years ago |

Yup, thank the hobbyists. Pytorch is allowing other hardware. Stable diffusion working on m chips, intel arc, and Amd.

Now what I'd like to see is real benchmarks for compute power. Might even get a few startups to compete in this new area.

withwarmup 2 years ago |

CUDA is the result of years of NVIDIA supporting the ecosystem, some people likes to complain because they bought hardware that was cheaper but can't use it for what they want to use it, when you buy NVIDIA, you aren't buying only the hardware, but the insane amount of work they have put into the ecosystem, the same goes for Intel, mkl and scikit-learn intelex aren't free to develop.

AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.

I really wish for more competitors to enter the market in HPC, but AMD has a shitload of work to do.

arcanus 2 years ago | |

> AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.

You are probably two years behind the state of the art. The world's largest supercomputer, OLCF's Frontier, runs AMD CPUs and GPUs. It's emphatically using ROCm, not just BLIS and AOCL. See for example: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html

That's hardly non-existent support for HPC.

65a 2 years ago | | |

Agreed...the main gap is support on consumer and workstation cards, which is where nVidia made headway, but that is starting erode super recently. ROCm works pretty well for me, I have had a lot more problems with specific packagers than the ROCm layer.

aiunboxed 2 years ago | |

Exactly, with NVIDIAs core focus on AI way before it was cool has lead to them being in this advantageous position. For AMD just being a price friendly competitor to Intel and Nvidia was the motto.

runiq 2 years ago | |

Yeah, that's a pretty shortsighted take of things. Do you really believe that Nvidia hasn't taken steps do make sure their moat is as wide as possible?

Blammar 2 years ago | | |

The thing about owning the CUDA spec is that Nvidia can add new features quickly without having to argue with other hardware vendors. I find that a positive thing overall.

Also, I choose to pay the ~$120 Windows tax once (per box), everything works very well, and I don't have the driver issues that some fraction of other users seem to have with Linux and Nvidia cards. Seems like a good use of my time.

pama 2 years ago |

There is only limited empirical evidence of AMD closing the gap that NVidia has created in the science or ML software. Even when considering pytorch only, the engineering effort to maintain specialized ROCm along with CUDA solutions is not trivial (think flashattention, or any customization that optimizes your own model). If your GPUs only need a simple ML workflow all times for a few years nonstop, maybe there exist corner cases where the finances make sense. It is hard for AMD now to close the gap across the scientific/industrial software base of CUDA. NVidia feels like a software company for the hardware they produce; luckily they make the money from hardware thus cannot lock the software libraries.

(Edited “no” to limited empirical evidence after a fellow user mentioned El Capitan.)

fotcorn 2 years ago | |

ROCm has HIP (1) which is a compatibility layer to run CUDA code on AMD GPUs. In theory, you only have to adjust #includes, and everything should just work, but as usual, reality is different.

Newer backends for AI frameworks like OpenXLA and OpenAI Triton directly generate GPU native code using MLIR and LLVM, they do not use CUDA apart from some glue code to actually load the code onto the GPU and get the data there. Both already support ROCm, but from what I've read the support is not as mature yet compared to NVIDIA.

1: https://github.com/ROCm-Developer-Tools/HIP

Certhas 2 years ago | |

The fact that El Capitan is AMD says that at least for Science/HPC there definitely is evidence of a closing gap.

pama 2 years ago | | |

Thanks. You are actually right that this new supercomputer might move the needle once it is in production mode. I will wait and see how it goes.

falconroar 2 years ago | |

I don't understand why developers of PyTorch and similar don't use OpenCL. Open standard, runs everywhere, similar performance - what's the downside??

pama 2 years ago | | |

I don’t know for sure why the early pytorch team picked it, but my guess is due to simplicity and performance. NVidia optimizes CUDA better that OpenCL and provides tons of useful performance tuning tools. It is hard to match the CUDA performance with OpenCL even on the same NVidia GPU hardware, and making performant code compatible across different GPU with OpenCL is also hard. I know examples of scientific codes that became simpler and faster (on nvidia hardware) by going from openCL to CUDA but haven’t yet heard of examples the other way around.

Roark66 2 years ago |

I think the article claiming "PyTorch has dropped the drawbridge on the CUDA moat" is way over optimistic. Jest pytorch is widely used by researchers and by users to quickly iterate various over various ways to use the models, but when it comes to inference there are huge gains to be had by going a different route. Llama.cpp has showed 10x speedups on my hardware for example (32gb of gpu ram + 32gb of cpu ram)for models like falcon-40b-instruct, for much smaller models on the cpu (under 10b) I saw up to 3x speedup just by switching to onnc and openvino.

Apple has showed us in practice the benefits of CPU/GPU memory sharing, will AMD be able to follow in their footsteps? The article claims AMD has a design with up to 192gb of shared ram. Apple is already shipping a design with the same amount of RAM(if you can afford it). I wish them-and) success, but I believe they need to aim higher than just matching apple in some unspecified future.

bigcat12345678 2 years ago |

Cuda is the foundation

NVIDIA moat is the years of work built by oss community, big corporations, research insistute

They spend all time building for cuda, a lot of implicit designs are derived from cuda's characteristic

That will be the main challenge

mikepurvis 2 years ago | |

It depends on the domain. Increasingly people's interfaces to this stuff are the higher level libraries like tensorflow, pytorch, numpy/cupy, and to a lesser degree accelerated processing libraries such as opencv, PCL, suitesparse, ceres-solver, and friends.

If you can add hardware support to a major library and improve on the packaging and deployment front while also undercutting on price, that's the moat gone overnight. CUDA itself only matters in terms of lock-in if you're calling CUDA's own functions.

bigcat12345678 2 years ago | | |

what I meant is that all these stuff have 15 years of implicit accumulation of knowledge and tips and even hacks builtin in the software

No matter what you depends on, you'll have a slew of larger or minor obstacles or annoyance

That collectively is the most itself

As you said, already it's clear that replacing cuda itself is not that daunting

pixelesque 2 years ago |

Does AMD have a solution to forward device combatibility (like PTX for NVidia)?

Last time I looked into ROCm (two years ago?), you seemed to have to compile stuff explicitly for the architecture you were using, so if a new card came out, you couldn't use it without a recompile.

mnau 2 years ago | |

Not natively, but AdaptiveCpp (previously hiSycl, then OpenSycl) has a single source single compiler pass, where they basically store LLVM IR as an intermediate representation.

https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/...

Performance penalty was within ew precents, at least according to the paper (figure 9 and 10) https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...

einpoklum 2 years ago | |

I don't know what they do with ROCm, but with OpenCL, the answer is: Certainly. It's called SPIR:

https://www.khronos.org/spir/

nabla9 2 years ago |

> Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.

Nvidia has spent huge amount of work to make code run smoothly and fast. AMD has to work hard to catch up. ROCm code is slower , has more bugs, don't have enough features and they have compatibility issues between cards.

RcouF1uZ4gsC 2 years ago |

I am not so sure.

Everyone knows that CUDA is a core competency of Nvidia and they have stuck to it for years and years refining it, fixing bugs, and making the experience smoother on Nvidia hardware.

On the other hand, AMD has not had the same level of commitment. They used to sing the praises of OpenCL. And then there is ROCm. Tomorrow, it might be something else.

Thus, Nvidia CUDA will get a lot more attention and tuning from even the portability layers because they know that their investment in it will reap dividends even years from now, whereas their investment in AMD might be obsolete in a few years.

In addition, even if there is theoretical support, getting specific driver support and working around driver bugs is likely to be more of a pain with AMD.

AnthonyMouse 2 years ago | |

This is what people complain about, but at the same time there aren't enough cards, so the people with AMD cards want to use them. So they fix the bugs, or report them to AMD so they can fix them, and it gets better. Then more people use them and submit patches and bug reporters, and it gets better.

At some point the old complaints are no longer valid.

hot_gril 2 years ago |

People complain about Nvidia being anticompetitive with CUDA, but I don't really see it. They saw a gap in the standards for on-GPU compute and put tons of effort into a proprietary alternative. They tied CUDA to their own hardware, which sorta makes technical sense given the optimizations involved, but it's their choice anyway. They still support the open standards, but many prefer CUDA and will pay the Nvidia premium for it because it's actually nicer. They also don't have CPU marketshare to tie things to.

Good for them. We can hope the open side catches up either by improving their standards, or adding more layers like this article describes.

zirgs 2 years ago | |

CUDA was released in 2007 and the development of it started even earlier - possibly even in the 90s. Back then nobody else cared about GPU compute. OpenCL came out 2 years after that.

killerstorm 2 years ago | | |

Not true. People got interested in general-purpose GPU compute (GPGPU) in early 2000s when video cards with programmable shaders became available. https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

People made a programming language & a compiler/runtime for GPGPU in 2004: https://en.wikipedia.org/wiki/BrookGPU

binarymax 2 years ago |

And the question for most that remains once AMD catches up: will the duopoly result in lower prices to a reasonable level for hobbyists or bootstrapped startups, or will AMD just gouge like NVidia?

superkuh 2 years ago |

>There is also a version of PyTorch that uses AMD ROCm, an open-source software stack for AMD GPU programming. Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.

Unfortunately since the AMD firmware doesn't reliably do what it's supposed to those ROCm calls often don't either. That's if your AMD card is even still supported by ROCm: the AMD RX 580 I bought in 2021 (the great GPU shortage) had it's ROCm support dropped in 2022 (4 years support total).

The only reliable interface in my experience has been via opencl.

htrp 2 years ago | |

has opencl actually improved enough to be competitive?

orangepurple 2 years ago | | |

I thought ONNX is supposed to be the ultimate common denominator for machine learning model cross platform compatibility

zucker42 2 years ago | |

Do you mean OpenCL using Rusticl or something else? And what DL framework, if any?

superkuh 2 years ago | | |

I should clarify that I mean for human person uses. Not commercial or institutional. But, clBLAST via llama.cpp for LLM currently. Or far in the past just pure opencl for things with AMD cards.

65a 2 years ago | |

ROCm works fine on my 2016 Vega Frontier edition, for what it's worth.

the__alchemist 2 years ago |

When coding using Vulkan, for graphics or compute (The latter is the relevant one here), you need to have CPU code (Written in C++, Rust etc), then serialize it as bytes, then have shaders which run on the graphics card. This 3-step process creates friction, much in the same way as backend/serialization/frontend does in web dev. Duplication of work, type checking not going across the bridge, the shader language being limited etc.

My understanding is CUDA's main strength is avoiding this. Do you agree? Is that why it's such a big deal? Ie, why this article was written, since you could always do compute shaders on AMD etc using Vulkan.

mark_l_watson 2 years ago |

NVidia hardware/CUDA stack is great, but I also love to see competition from AMD, George Hotz’s Tiny Corp, etc.

Off topic, but I am also looking with great interest at Apple Silicon SOCs with large internal RAM. The internal bandwidth also keeps getting better which is important for running trained LLMs.

Back on topic: I don’t own any current Intel computers but using Colab and services like Lambda Labs GPU VPSs is simple and flexible. A few people here mentioned if AMD can’t handle 100% of their workload they will stick with Intel and NVidia - understandable position, but there are workarounds.

physicsguy 2 years ago |

Don’t agree at all. PyTorch is one library - yes, it’s important that it supports AMD GPUs but it’s not enough.

The ROCm libraries just aren’t good enough currently. The documentation is poor. AMD need to heavily invest in their software ecosystem around it, because library authors need decent support to adopt it. If you need to be a Facebook sized organisation to write an AMD and CUDA compatible library then the barrier to entry is too high.

weebull 2 years ago | |

Disagree that the Rocm libraries are poor. Their integration with everything else is poor because everything else is so highly Nvidia centric, and AMD can't just write to the same API because it's copyright Nvidia (see Oracle's Java case).

The adoption of CUDA has been such a coop for Nvidia, it's going to take some time to dismantle it.

physicsguy 2 years ago | | |

I don’t use high level frameworks like PyTorch because my work is in computational physics so I do actually use the lower level libraries. The documentation doesn’t even come close although it has got better. But they’re just not at feature parity, and that’s not on anyone but AMD currently. They need to invest more in the core libraries.

Just look at cuFFT vs rocFFT for e.g… they aren’t even close to being at feature parity - things like multi GPU is totally missing and callbacks are still “experimental”. These are pretty basic features - bear in mind that when people ported from CPU codes CUDA had to support these because they existed in FFTW (transforms over multiple CPUs rather than GPUs though via MPI).

alecco 2 years ago |

Regurgitated months-old content. blogspam

ris 2 years ago |

I don't understand the author's argument (if there is one) - pytorch has existed for ages. AMD's Instinct MI* range has existed for years now. If these are the key ingredients why has it not already happened?

fluxem 2 years ago |

I call it the 90% problem. If AMD works for 90% of my projects, I would still buy NVIDIA, which works for 100%, even though I’m paying a premium

hot_gril 2 years ago | |

I'm lazy, so it's 99% for me. I don't even mess with AMD CPUs; I know they're not exactly the same instruction set as Intel, and more importantly they work with a different (and less mainstream) set of mobos, so I don't want em. If AMD manages to pull more customers their way, that's great, it just means lower Intel premium for me.

bornfreddy 2 years ago | | |

That's an interesting take. AMD mobos are no "less mainstream" than Intel ones are... When you choose a CPU you are also choosing a compatible mobo chipset. The companies that make motherboards are mostly the same, so there should be no big difference between those.

Also, while the CPU instruction sets are not exactly equal, the same is true for Intel processors of different generations too. And it doesn't matter one bit... Unless there is a bug in CPU you will never notice the difference, because it is taken care of at the compiler / kernel level.

Intel does have some advantages (and disadvantages too) over AMD, just not those.

anon291 2 years ago | | |

I have no idea what you're talking about. Amd and Intel match on the isa in any case you'd see typically. Moreover, Intel is currently using AMDs instruction set. X86_64 was designed my amd and used to be called AMD64

Flameancer 2 years ago | | |

What mainstream board company is intel only? Maybe a decade ago on AM3(+) but on AM5/AM5 I haven’t seen a main board partner not offer the same board SKU that works with Intel and AMD.

65a 2 years ago | | |

As an owner of some Sapphire Rapids parts, let me just direct you to: https://edc.intel.com/content/www/us/en/design/products-and-...

hot_gril 2 years ago | | |

Forgot to also mention iGPU and other on-chip accelerators being different and Intel usually having the edge there.

nologic01 2 years ago |

If the AI hype persists the CUDA moat will be less relevant in ~2 yrs.

Historically HPC was simply not sufficiently interesting (in commercial sense) for people to throw serious resources in the direction of making it a mass market capability.

NVIDIA first capitalized on the niche crypto industry (which faded) and was then well positioned to jump into the AI hype. The question is how much of the hype will become real business.

The critical factor for the post-CUDA world is not any circumstantial moat but who will be making money servicing stable, long term computing needs. I.e., who will be buying this hardware not with speculative hot money but with cashflow from clients that regularly use and pay for a HPC-type application.

These actors will be the long term buyers of commercially relevant HPC and they will have quite a bit of influence on this market.

ddtaylor 2 years ago |

It's worth noting that AMD also has a ROCm port of Tensorflow.

ginko 2 years ago | |

When I try to install rocm-ml-sdk on Arch linux it'll tell me the total installed size would be about 18GB.

What can possibly explain this much bloat for what should essentially be a library on top of a graphics driver as well as some tools (compiler, profiler etc.)? A couple hundred MB I could understand if they come with graphical apps and demos, but not this..

tomsmeding 2 years ago | | |

A regular TensorFlow installation, just the Python library, is an 184 MB wheel that unpacks to about 1.2 GB of stuff. I have no clue what mess goes in there, but it's a lot.

Still, if you're right that this package seems to take 18 GB disk size, something weird is going on.

sharonzhou 2 years ago |

ROCm is great. We were able to get run and finetune LLMs on AMD Instincts with parity to NVIDIA A100s - and built an SDK that’s as easy to use as HuggingFace or easier (Lamini). Or at the very least, our designer is able to finetune/train the latest LLMs on them like Llama 2 - 70B and Mistral 7B with ease. The ROCm library isn’t as easy to use as CUDA because as another poster said, the ecosystem was built around CUDA. For example, it’s even called “.cuda()” in PyTorch to put a model on a GPU, when in reality you’d use it for an AMD GPU too.

atemerev 2 years ago |

Nope. PyTorch is not enough, you have to do come C++ occasionally (as the code there can be optimized radically, as we see in llama.cpp and the like). ROCm is unusable compared to CUDA (4x more code for the same problem).

I don't understand why everyone neglects good, usable and performant lower-level APIs. ROCm is fast, low-level, but much much harder to use than CUDA, and the market seems to agree.

voz_ 2 years ago |

The amount of random wrong stuff about pytorch in this thread is pretty funny.

whywhywhywhy 2 years ago |

Anyone who has to work in this ecosystem surely thinks this is a naive take

freedomben 2 years ago | |

For someone who doesn't work in this ecosystem, can you elaborate? What's the real situation currently?

whywhywhywhy 2 years ago | | |

Nvidia CUDA was first to market, easier to work with that OpenCL which was the only competition for the first decade then abandoned. Because of this then all the people serious about this are using Nvidia hardware therefore all the code is written for Nvidia hardware.

Only way I could see AMD making inroads if they were willing to provide power of the level Nvidia puts in a data center at consumer prices and relaxed licensing to justify retooling the entire ML chain to work on a different architecture.

Geohot has documented his troubles trying to go all in on AMD and he's back on Nvidia now I believe.

benreesman 2 years ago |

I know a lot of people don’t like George, I dislike plenty of people who are doing the right thing thing (including by some measures sama and siebel while they were pushing YC forward).

But not admitting the tinygrad project is the best Rebel Alliance on this is just a matter of letting vibe overcome results.

frnkng 2 years ago |

As a former ETH miner I learned the hard way that saving a few bucks on hardware may not be worth operational issues.

I had a miner running with Nividia cards and a miner running with AMD cards. One of them had massive maintenance demand and the other did not. I will not state which brand was better imho.

Currently I estimate that running miners and running gpu servers has similar operational requirements and finally at scale similar financial considerations.

So, whatever is cheapest to operate in terms of time expenditure, hw cost, energy use,… will be used the most.

P.s.: I ran the mining operation not to earn money but mainly out of curiosity. And it was a small scale business powered by a pv system and a attached heat pump.

latchkey 2 years ago | |

I ran 150,000+ AMD cards for mining ETH. Once I fully automated all the vbios installs and individual card tuning, it ran beautifully. Took a lot of work to get there though!

Fact is that every single GPU chip is a snowflake. No two operate the same.

rottencupcakes 2 years ago | | |

Have you ever written about this enterprise? This sounds super unique and I would be very interested in hearing about how it was run and how it turned out.

pjmlp 2 years ago |

Unless they get their act together regarding CUDA polyglot tooling, I seriously doubt it.

ElectronBadger 2 years ago |

On my PC workstation (Debian Testing) I have absolutely no problems running NVIDIA PNY Quadro P2200, which I'm going to upgrade with PNY Quadro RTX 4000 soon. I'd love to make a switch for AMD Radeon, but the very short (and shrinking) list of ROCm supported cards makes this move highly improbable for the not-so-nearest future.

upbeat_general 2 years ago |

This article doesn’t address the real challenge [in my mind].

Framework support is one thing, but what about the million standalone CUDA kernels that have been written, especially common in research. Nobody wants to spend time re-writing/porting those, especially when they probably don’t understand the low-level details in the first place.

Not to mention, what is the plan for comprehensive framework support? I’ve experienced the pain of porting models to different hardware architectures where various ops are unsupported. Is it realistic to get full coverage of e.g., PyTorch?

bdowling 2 years ago | |

Someone could reimplement CUDA for AMD hardware. That would be legal because copying APIs for compatibility purposes is not copyright infringement. (See Google LLC v. Oracle America, Inc., 593 U.S. ___ (2021)).

AMD is unlikely to do this, however, because it would commodify their own products under their competitor’s API.

A third party could do it though. It may make sense as an open source project.

blueboo 2 years ago | |

Research kernels mostly turn to ash upon publication anyway. The wheel turns and the next post-doc gives ROCm a try and we move on

hankman86 2 years ago |

I suspect that AMD will use their improved compatibility with the leading ML stack for data center deals. Presumably by offering steep discounts over NVIDIA’s GPUs. This might help them to break into the market.

Individual ML practitioners will probably not be tempted to switch to AMD cards anytime soon. Whatever the price difference is: it will hardly offset the time that is subsequently sunk into working around remaining issues resulting from a non-CUDA (and less mature) stack underneath PyTorch.

falconroar 2 years ago |

Is there any reason OpenCL is not the standard in implementations like PyTorch? Similar performance, open standard, runs everywhere - what's the downside?

LoganDark 2 years ago | |

IIRC, ease of implementation (for the GPU kernels), and cross-compatibility (the same bytecode can be loaded by multiple models of GPU).

ealloc 2 years ago | | |

How is CUDA-C that much easier than OpenCL? Having ported back and forth myself, the base C-like languages are virtually identical. Just sub "__syncthreads();" for "barrier(CL_MEM_FENCE)" and so on. To me the main problem is that Nvidia hobbles OpenCL on their GPUs by not updating their CL compiler to OpenCL 2.0, so some special features are missing, such as many atomics.

jacobgorm 2 years ago | | |

The ease of implementation using CUDA means that your code because effed for life, because it is no longer valid C/C++, unless you totally litter it with #ifdefs to special case for CUDA. In my own proprietary AI inference pipeline I've ended up code-generating to a bunch of different backends (OpenCL SpirV, Metal, CUDA, HLSL, CPU w. OpenMP), giving no special treatment to CUDA, and the resulting code is much cleaner and builds with standard open source toolchains.

JonChesterfield 2 years ago | |

Downsides are it can't express a bunch of stuff cuda or openmp can plus the nvidia opencl implementation is worse than their cuda one. So opencl is great if you want a lower performance way of writing a subset of the programs you want to write.

tails4e 2 years ago |

AMD playing catch up is a good thing, their SW solution is intended to run on any HW, and with hip being basically line for line compatible with cuda it makes porting very easy. They did it with FSR,and they are doing it with rocm. Hopefully it takes off as it's a more open ecosystem for the industry. Necessity is the mother of invention and all that.

tormeh 2 years ago |

For LLM inference, a shoutout to MLC LLM, which runs LLM models on basically any API that's widely available: https://github.com/mlc-ai/mlc-llm

einpoklum 2 years ago |

TL;DR:

1. Since PyTorch has grown very popular, and there's an AMD backend for that, one can switch GPU vendors when doing Generative AI work.

2. Like NVIDIA's Grace+Hopper CPU-GPU combo, AMD is/will be offering "Instinct MI300A", which improves performance over having the GPU across a PCIe bus from a regular CPU.

ur-whale 2 years ago |

> AMD May Get Across the CUDA Moat

I really wish they would, and properly, as in: fully open solution to match CUDA.

CUDA is a cancer on the industry.

mschuetz 2 years ago | |

What's wrong with CUDA? I avoided it for years because it's proprietory but about one year ago I started using it because all the alternatives (OpenGL/Vulkan compute, OpenCL, WebGPU, ...) couldn't quite do what I wanted, and it turned out to be a game changer. Nothing comes close to it. Now I'm hooked because there simply isn't an alternative that's as easy to use, yet powerfull and fast.

I wish there was an open alternative, but NVIDIA did several things right that others, especially Khronos, do not: The UX is top-notch. It makes the common cases easy yet still fast, and from there you can optimize to your hearts content. Khronos, however, usually completely over-engineers things and makes the common case hard and cumbersome with massive entry barriers.

ur-whale 2 years ago | | |

> What's wrong with CUDA?

Read on

> it's proprietory

Yes indeed, proprietary

> Now I'm hooked

There you go.

> I wish there was an open alternative

So does the rest of the industry.

Specifically, it forces you to run your stuff on NVidia hardware and gives you exactly zero guarantee of future support.

Good luck trying to reproduce whatever research you are currently conducting in 10 years time.

Vendor lock-in + no forward compatibility guarantee = surefire recipe for getting milked to the bone by NVidia.

raggi 2 years ago |

Can we just get wgsl compute good enough and over the line instead, and do away with these moats?

mschuetz 2 years ago | |

Not happening. WGSL wants to support the lowest common denominator, so it'll always mainly be a 5-year old mobile-phone API. Also if you want to beat CUDA, you'll need some functionality that's completely missing in compute shaders, especially WGSL. Like pointers and pointer casting (and that glsl buffer reference extension is the worst emulation of that feature I've every seen).

raggi 2 years ago | | |

The language extensions feature is designed to provide these kinds of facilities is it not?

jeffreygoesto 2 years ago |

I am hoping for SYCL and SPIR-V to gain traction...

jiggawatts 2 years ago |

Can I buy an MI300 or even rent one in a cloud?

arcanus 2 years ago | |

Soon. The card is coming in Q4. The early shipments are likely all going to LLNL's El Capitan Exascale computer: https://www.tomshardware.com/news/amds-instinct-mi300-moves-...

spandextwins 2 years ago |

That’s like saying Ford is gonna catch Tesla.

cantaloupe 2 years ago | |

Do you see that as an inevitability or an impossibility?

tpmx 2 years ago | |

No, not really. They have similar enough silicon, they "just" need some software to make it work.

Zetobal 2 years ago |

They are just too late even if they catch up. Until they make a leap like they did with ryzen nothing will happen.

Havoc 2 years ago | |

>They are just too late even if they catch up.

Late certainly, too late I don't think so.

If you can field a competitively priced consumer card that can run llama fast then you're already halfway there because then the ecosystem takes off. Especially since nvidia is being really stingy with their vram amounts.

H100 & datacenter is a separate battle certainly, but on mindshare I think some deft moves from AMD will get them there quite fast once they pull their finger out their A and actually try sorting out the driver stack.

dylan604 2 years ago | | |

>If you can field a competitively priced consumer card

if this unicorn were to show up, what's to say that all the non-consumers won't just scarf up these equally performant yet lower priced cards causing the supply-demand situation we're in now? the only difference would be a sudden supply of the expensive Nvidia cards that nobody wants because of their price.