AMD may get across the CUDA moat(hpcwire.com) |
AMD may get across the CUDA moat(hpcwire.com) |
But not admitting the tinygrad project is the best Rebel Alliance on this is just a matter of letting vibe overcome results.
I had a miner running with Nividia cards and a miner running with AMD cards. One of them had massive maintenance demand and the other did not. I will not state which brand was better imho.
Currently I estimate that running miners and running gpu servers has similar operational requirements and finally at scale similar financial considerations.
So, whatever is cheapest to operate in terms of time expenditure, hw cost, energy use,… will be used the most.
P.s.: I ran the mining operation not to earn money but mainly out of curiosity. And it was a small scale business powered by a pv system and a attached heat pump.
Fact is that every single GPU chip is a snowflake. No two operate the same.
Framework support is one thing, but what about the million standalone CUDA kernels that have been written, especially common in research. Nobody wants to spend time re-writing/porting those, especially when they probably don’t understand the low-level details in the first place.
Not to mention, what is the plan for comprehensive framework support? I’ve experienced the pain of porting models to different hardware architectures where various ops are unsupported. Is it realistic to get full coverage of e.g., PyTorch?
AMD is unlikely to do this, however, because it would commodify their own products under their competitor’s API.
A third party could do it though. It may make sense as an open source project.
Individual ML practitioners will probably not be tempted to switch to AMD cards anytime soon. Whatever the price difference is: it will hardly offset the time that is subsequently sunk into working around remaining issues resulting from a non-CUDA (and less mature) stack underneath PyTorch.
1. Since PyTorch has grown very popular, and there's an AMD backend for that, one can switch GPU vendors when doing Generative AI work.
2. Like NVIDIA's Grace+Hopper CPU-GPU combo, AMD is/will be offering "Instinct MI300A", which improves performance over having the GPU across a PCIe bus from a regular CPU.
I really wish they would, and properly, as in: fully open solution to match CUDA.
CUDA is a cancer on the industry.
I wish there was an open alternative, but NVIDIA did several things right that others, especially Khronos, do not: The UX is top-notch. It makes the common cases easy yet still fast, and from there you can optimize to your hearts content. Khronos, however, usually completely over-engineers things and makes the common case hard and cumbersome with massive entry barriers.
Read on
> it's proprietory
Yes indeed, proprietary
> Now I'm hooked
There you go.
> I wish there was an open alternative
So does the rest of the industry.
Specifically, it forces you to run your stuff on NVidia hardware and gives you exactly zero guarantee of future support.
Good luck trying to reproduce whatever research you are currently conducting in 10 years time.
Vendor lock-in + no forward compatibility guarantee = surefire recipe for getting milked to the bone by NVidia.
Late certainly, too late I don't think so.
If you can field a competitively priced consumer card that can run llama fast then you're already halfway there because then the ecosystem takes off. Especially since nvidia is being really stingy with their vram amounts.
H100 & datacenter is a separate battle certainly, but on mindshare I think some deft moves from AMD will get them there quite fast once they pull their finger out their A and actually try sorting out the driver stack.
if this unicorn were to show up, what's to say that all the non-consumers won't just scarf up these equally performant yet lower priced cards causing the supply-demand situation we're in now? the only difference would be a sudden supply of the expensive Nvidia cards that nobody wants because of their price.
yes, yes it absolutely does. establishing market dominance as everyone wants to use CUDA but almost nobody wants to write their kernel twice.
As I said, I avoided it for years because of the reasons you mentioned. Turns out I could not avoid it any longer because it's the only (meaningful) option that could do what I needed, has serious support, and great UX. And NVIDIA is hardly to blame because they simply made sure to build a good product. It can't stop AMD, Intel or Khronos from creating a competitive alternative, but so far they haven't.
And regarding support, so far NVIDIA has shown excellent continuous support for CUDA, whereas OpenCL and OpenGL are the ones that went down. And I've chosen CUDA over rocm precisely due to support reasons, because AMD has always treated it as some kind of side gig with uncertain future.
I've put a bunch of comments here on HN about the stuff I can talk about.
It no longer exists after PoS.
Got the chips directly from AMD. Since these are 4-5 year old chips, they were not going to ever be used. It is more ROI efficient with ETH mining to use older cards than newer ones.
Had a couple OEM manufacture the cards specially for us with 8gb, heatsinks instead of fans (lower power usage) and no display ports (lower cost).
They will be recycled as there isn't much use for them now.
I'm also no longer with the company.
One way to do that may be to produce a card on an older process node (or the existing one when a new one comes out) that has a lot of VRAM. There is less demand for the older node so they can produce more of them and thereby sell them for a lower price without running out.
A unicorn like that showed up a couple hours ago. Someone posted a guide for getting llama to run on a 7900xtx
https://old.reddit.com/r/LocalLLaMA/comments/170tghx/guide_i...
It's still slow and janky but this really isn't that far away.
I don't buy that AMD can't make this happen if they actually tried.
Go on fiverr, get them to compile a list of top 100 people in the DIY LLM space, send them all free 7900XTXs. Doesn't matter if half of it is wrong, just send it. Next take 1.2m USD, post a dozen 100k bounties against llama.cpp that are AMD specific - support & optimise the gear. Rinse and repeat with every other hobbyist LLM/stable diffusion project. A lot of these are zero profit open source / passion / hobby projects. If 6 figure bounties show up it'll absolute raise pulses. Next do all the big youtubers in the space - carefully on that one so that it doesn't come across as an attempted pay-off...but you want them to know that you want this space to grow and are willing to put your money where your mouth is.
That'll cost AMD what 2m 3m? To move the needle on a multi billion market? That's the cheapest marketing you've ever seen.
As I said the datacenter & enterprise market is another beast entirely full of moats and strategy, but I don't see why a suitably motivated senior AMD exec can't tackle the enthusiast market single handedly with a couple of emails, a cheque book and a tshirt that has the nike slogan on it.
>what's to say that all the non-consumers won't just scarf up these equally performant yet lower priced cards
It doesn't matter. They're in the business of selling cards. To consumers, to datacenters, to your grandmother. From a profit driven capitalist company the details don't matter as long as there is traction & volume. The above - opening up even the possibility of a new market - is gold in that perspective. And from a consumer perspective anything that breaks the nvidia cuda monopoly is a win.
The bigger problem is on the training/research support. Eg, here's no official support for AMD GPUs for bitsandbytes, and no support at all for FlashAttention/FA2 (nothing that 100K in hardware/grants to Dettmers or Dao's labs wouldn't fix I suspect).
The real elephant though is that AMD still having the disconnect that lack of support for consumer cards and home/academic devs in general has been disastrous (while Nvidia supports CUDA on basically every single GPU they've made since 2010) - just last week there was this mindblowing thread where it turns out an AMD employee is paying out of pocket for AMD GPUs to support build/CI for drivers on Debian. I mean, WTF, that's stupidity that's beyond embarrassing and gets into negligence terriroty IMO: https://news.ycombinator.com/item?id=37665784
I hope he's at least getting an employee discount! I guess AMD is not a fan of the 20% concept either
That said, I made it work, which was an insane amount of work, and it mined really well.
Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility and to keep code simple.
Protip: Use the official ROCM Pytorch base docker image [0]. The AMD setup is so finicky and dependent on specific versions of sdk/drivers/libraries and it will be much harder to make work if you try to install them separately.
[0]: https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/p...
So it's important that vendors don't feel let off the hook to provide sane packaging just because there's an option to use a kitchen-sink container image they rebuild every day from source.
https://github.com/RadeonOpenCompute/ROCm-docker/blob/master...
They also have some for Fedora. Looks like for this you need to install their repo:
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - \
&& printf "deb [arch=amd64] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" | tee /etc/apt/sources.list.d/rocm.list \
&& printf "deb [arch=amd64] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" | tee /etc/apt/sources.list.d/amdgpu.list \
then install Python, a couple other dependencies (build-essential, etc) and then the package in question: rocm-devSo they are doing the packaging. There might even be documentation elsewhere for that type of setup.
Sadly if e.g. 95% of their users can use the container, then it could make economical sense to do it that way.
is this a real problem? exactly which embedded platform has a device that ROCm supports?
X86 cannot do 64 bit let us do this and that so the market can use only our cpu. Repeat with me x86-64 is impossible.
Not sure Apple is in this otherwise the real great competition come.
Man oh man where did we go wrong that cuda is the more compatible option over OpenCL?
AMD should just get it's shit together. This is ridiculous. Not the name, but the fact that you can only do FP64 on a GPU. Everybody is moving to FP16 and AMD is stuck on doubles?
There’s getting something to “work”, which is often enough of a challenge with ROCm. Then there’s getting it to work well (next challenge).
Then there’s getting it to work as well as Nvidia/CUDA.
With Whisper, as one example, you should be running it with ctranslate2[0]. Of all the platforms on their supported list you won’t find ROCm.
When you really start to look around you’ll find that ROCm is (at best) still very much in the “get it to work (sometimes)” stage. In most cases it’s still a long way away from getting it to work well, and even further away from making it actually competitive with Nvidia for serious use cases and applications.
People get excited about the progress ROCm has made getting basic things to work with PyTorch and this is good - progress is progress. But saving 20% on the hardware when the equivalent Nvidia product is often somewhere between 5-10x as performant (at a fraction of the development time) because of vastly superior software support you realize pretty quickly Nvidia is actually a bargain compared to AMD.
I’m desperately rooting for Nvidia to have some actual competition but after six years of ROCm and my own repeated failed attempts to have it make any sense overall I’m only more and more skeptical that real competition in the space will come from AMD.
One arcane detail is that whereas for PyTorch I have to set the env var HSA_OVERRIDE_GFX_VERSION to 10.3.0, getting it to run with whisper.cpp and llama.cpp requires setting it to 10.1.0. Good luck and may it cost you less hair than it did me.
NVIDIA fp32 (H100) has 2x more TFLOPS than AMD's fp32 (MI250) and AI doesn't need fp64 precision.
Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.
Now what I'd like to see is real benchmarks for compute power. Might even get a few startups to compete in this new area.
AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.
I really wish for more competitors to enter the market in HPC, but AMD has a shitload of work to do.
You are probably two years behind the state of the art. The world's largest supercomputer, OLCF's Frontier, runs AMD CPUs and GPUs. It's emphatically using ROCm, not just BLIS and AOCL. See for example: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
That's hardly non-existent support for HPC.
Also, I choose to pay the ~$120 Windows tax once (per box), everything works very well, and I don't have the driver issues that some fraction of other users seem to have with Linux and Nvidia cards. Seems like a good use of my time.
(Edited “no” to limited empirical evidence after a fellow user mentioned El Capitan.)
Newer backends for AI frameworks like OpenXLA and OpenAI Triton directly generate GPU native code using MLIR and LLVM, they do not use CUDA apart from some glue code to actually load the code onto the GPU and get the data there. Both already support ROCm, but from what I've read the support is not as mature yet compared to NVIDIA.
Apple has showed us in practice the benefits of CPU/GPU memory sharing, will AMD be able to follow in their footsteps? The article claims AMD has a design with up to 192gb of shared ram. Apple is already shipping a design with the same amount of RAM(if you can afford it). I wish them-and) success, but I believe they need to aim higher than just matching apple in some unspecified future.
NVIDIA moat is the years of work built by oss community, big corporations, research insistute
They spend all time building for cuda, a lot of implicit designs are derived from cuda's characteristic
That will be the main challenge
If you can add hardware support to a major library and improve on the packaging and deployment front while also undercutting on price, that's the moat gone overnight. CUDA itself only matters in terms of lock-in if you're calling CUDA's own functions.
No matter what you depends on, you'll have a slew of larger or minor obstacles or annoyance
That collectively is the most itself
As you said, already it's clear that replacing cuda itself is not that daunting
Last time I looked into ROCm (two years ago?), you seemed to have to compile stuff explicitly for the architecture you were using, so if a new card came out, you couldn't use it without a recompile.
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/...
Performance penalty was within ew precents, at least according to the paper (figure 9 and 10) https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...
Nvidia has spent huge amount of work to make code run smoothly and fast. AMD has to work hard to catch up. ROCm code is slower , has more bugs, don't have enough features and they have compatibility issues between cards.
Everyone knows that CUDA is a core competency of Nvidia and they have stuck to it for years and years refining it, fixing bugs, and making the experience smoother on Nvidia hardware.
On the other hand, AMD has not had the same level of commitment. They used to sing the praises of OpenCL. And then there is ROCm. Tomorrow, it might be something else.
Thus, Nvidia CUDA will get a lot more attention and tuning from even the portability layers because they know that their investment in it will reap dividends even years from now, whereas their investment in AMD might be obsolete in a few years.
In addition, even if there is theoretical support, getting specific driver support and working around driver bugs is likely to be more of a pain with AMD.
At some point the old complaints are no longer valid.
Good for them. We can hope the open side catches up either by improving their standards, or adding more layers like this article describes.
People made a programming language & a compiler/runtime for GPGPU in 2004: https://en.wikipedia.org/wiki/BrookGPU
Unfortunately since the AMD firmware doesn't reliably do what it's supposed to those ROCm calls often don't either. That's if your AMD card is even still supported by ROCm: the AMD RX 580 I bought in 2021 (the great GPU shortage) had it's ROCm support dropped in 2022 (4 years support total).
The only reliable interface in my experience has been via opencl.
My understanding is CUDA's main strength is avoiding this. Do you agree? Is that why it's such a big deal? Ie, why this article was written, since you could always do compute shaders on AMD etc using Vulkan.
Off topic, but I am also looking with great interest at Apple Silicon SOCs with large internal RAM. The internal bandwidth also keeps getting better which is important for running trained LLMs.
Back on topic: I don’t own any current Intel computers but using Colab and services like Lambda Labs GPU VPSs is simple and flexible. A few people here mentioned if AMD can’t handle 100% of their workload they will stick with Intel and NVidia - understandable position, but there are workarounds.
The ROCm libraries just aren’t good enough currently. The documentation is poor. AMD need to heavily invest in their software ecosystem around it, because library authors need decent support to adopt it. If you need to be a Facebook sized organisation to write an AMD and CUDA compatible library then the barrier to entry is too high.
The adoption of CUDA has been such a coop for Nvidia, it's going to take some time to dismantle it.
Just look at cuFFT vs rocFFT for e.g… they aren’t even close to being at feature parity - things like multi GPU is totally missing and callbacks are still “experimental”. These are pretty basic features - bear in mind that when people ported from CPU codes CUDA had to support these because they existed in FFTW (transforms over multiple CPUs rather than GPUs though via MPI).
Also, while the CPU instruction sets are not exactly equal, the same is true for Intel processors of different generations too. And it doesn't matter one bit... Unless there is a bug in CPU you will never notice the difference, because it is taken care of at the compiler / kernel level.
Intel does have some advantages (and disadvantages too) over AMD, just not those.
Historically HPC was simply not sufficiently interesting (in commercial sense) for people to throw serious resources in the direction of making it a mass market capability.
NVIDIA first capitalized on the niche crypto industry (which faded) and was then well positioned to jump into the AI hype. The question is how much of the hype will become real business.
The critical factor for the post-CUDA world is not any circumstantial moat but who will be making money servicing stable, long term computing needs. I.e., who will be buying this hardware not with speculative hot money but with cashflow from clients that regularly use and pay for a HPC-type application.
These actors will be the long term buyers of commercially relevant HPC and they will have quite a bit of influence on this market.
What can possibly explain this much bloat for what should essentially be a library on top of a graphics driver as well as some tools (compiler, profiler etc.)? A couple hundred MB I could understand if they come with graphical apps and demos, but not this..
Still, if you're right that this package seems to take 18 GB disk size, something weird is going on.
I don't understand why everyone neglects good, usable and performant lower-level APIs. ROCm is fast, low-level, but much much harder to use than CUDA, and the market seems to agree.
Only way I could see AMD making inroads if they were willing to provide power of the level Nvidia puts in a data center at consumer prices and relaxed licensing to justify retooling the entire ML chain to work on a different architecture.
Geohot has documented his troubles trying to go all in on AMD and he's back on Nvidia now I believe.
Turns out it was a conflict between nvidia drivers and my (10 year old) Intel integrated GPU. But once I switched to an AMD card, everything works flawlessly.
Ubuntu based systems barely worked at all. Incredibly unstable and would occasionally corrupt the output and barf colors and fragments of the desktop all over my screens.
AMD on arch has been an absolute delight. It just. Works. It's more stable than nvidia on windows.
For a lot of reasons-- but mainly Linux drivers-- I've totally sworn off nvidia cards. AMD just works better for me.
But the other problem that really bugs me is the "AMD reset bug" that you trip over with most AMD GPUs. This is when you pass through a second GPU through to another OS running under KVM, and is what lets you run Linux and (say) Windows simultaneously with full GPU hardware acceleration on the guest. The reset bug means the GPU will hang upon shutdown of the guest and only a reboot will let you recover the card. This is a silicon level bug that has existed for many years across many generations of cards and AMD can't be arsed to fix it. Projects like "vendor-reset" help for some cards, but gnif2 has basically given up (he mentioned he even personally raised the issue with Lisa Su). Even AMDs latest cards like the 7800 XT are affected. NVidia works flawlessly here.
After every kernel upgrade, I just have to reinstall the nvidia drivers and the cuda toolkit.
Everything works as before after I do that. I don't face any problems at all.
What AMD really needs is to have 100% feature parity with CUDA without changing a single line of code. Maybe for this to happen it needs to add hardware features or something (I see people saying that CUDA as an API is very tailored to the capabilities of nvidia GPUs), I don't know.
If AMD relies on people changing their code to make it portable, it already lost.
I think where that idea goes wrong is in order to compile it unmodified for nvptx, you need to use a toolchain which knows hip and nvptx, which the cuda toolchain does not. Clang can mostly compile cuda successfully but it's far less polished than the cuda toolchain. ROCm probably has the nvptx backend disabled, and even if it's built in, best case it'll work as well as clang upstream does.
What I'm told does work is keeping all the source as cuda and using hipify as part of a build process when using amdgpu - something like `cat foo.cu | hipify | clang -x hip -` - though I can't personally vouch for that working.
The original idea was people would write in opencl instead of cuda but that really didn't work out.
I'm wondering how true that is, because that could give NVidia issues in the future if they need to redesign their GPU should they hit some limit with the current designs. Dependence on certain instruction makes sense, but there's not technical preventing AMD from implementing those instructions, only legal mumbo jumbo.
I've literally been running nvidia on linux since the TNT2 days and have _never_ had this sort of issue. That's across many drivers and many cards over the many many years.
Your statement makes no sense. It's like a smoker claiming that since he didn't die of lung cancer, smoke is 100% safe.
My guess: something like laptop GPU switching failed badly in the nvidia binary, earning it a reputation.
Having said that - I (or rarely, other people) have almost always managed to work out those issues and get my systems to work. Not in all cases though.
Pop_OS, Fedora and OpenSUSE work out of the box. Those are all Wayland I believe. Debian/Ubuntu distros are a bad time. I think they’re still X11. It’s ironic because X11 is supposed to be the more stable window manager.
I'm definitely not against better hardware support for AI, but I think your problems are more GNOME's fault than Nvidia's. KDE's Wayland session is almost flawless on Nvidia nowadays.
Can not confirm. I used nvidia for years when it was the only option. Then used the nouveau driver on a well supported card because it worked well and eliminated hassle. Now I'm on AMD APU and it just works out of the box. YMMV of course. We do get reports of issues with AMD on specific driver versions, but I can't reproduce.
I have a nvidia laptop with popos. That works well.
Hobbyist and open-source are definitely not synonyms.
The way he stole Fail0verflow's work with the PS3 security leak after failing to find a hypervisor exploit for months absolutely soured any respect I had for him at the time
this is so far from accurate it should be considered libelous; from the link
> PyTorch/XLA is set to migrate to the open source OpenXLA
so PyTorch on the XLA backend is set to migrate to use OpenXLA instead of XLA. but basically everyone moved from XLA to OpenXLA because there is no more OSS XLA. so that's it. in general, PyTorch has several backends, including plenty of homegrown CUDA and CPU kernels. in fact the majority of your PyTorch code runs through PyTorch's own kernels.
If you use model.compile() in PyTorch, you use TorchInductor and OpenAIs Triton by default.
Well, let's say "smoother" rather than "smoothly".
> ROCm code is slower
On physically-comparable hardware? Possible, but that's not an easy claim to make, certainly not as expansively as you have. References?
> has more bugs
Possible, but - NVIDIA keeps their bug database secret. I'm guessing you're concluding this from anecdotal experience? That's fair enough, but then - say so.
> ROCm ... don't have enough features and
Likely. while AMD has both spent less in that department (and had less to spend I guess); plus, and no less importantly - it tried to go along with the OpenCL initiative, as specified by the Khronos consortium, while NVIDIA has sort of "betrayed" the initiative by investing in it's vendor-locked, incompatible ecosystem and letting their OpenCL support decay in some respects.
> they have compatibility issues between cards.
such as?
https://github.com/InternLM/lmdeploy
https://github.com/vllm-project/vllm
https://github.com/OpenNMT/CTranslate2
You know what’s missing from all of these and many more like them? Support for ROCm. This is all before you get to the really wildly performant stuff like Triton Inference Server, FasterTransformer, TensorRT-LLM, etc.
ROCm is at the “get it to work stage” (see top comment, blog posts everywhere celebrating minor successes, etc). CUDA is at the “wring every last penny of performance out of this thing” stage.
In terms of hardware support, I think that one is obvious. The U in CUDA originally stood for unified. Look at the list of chips supported by Nvidia drivers and CUDA releases. Literally anything from at least the past 10 years that has Nvidia printed on the box will just run CUDA code.
One of my projects specifically targets Pascal up - when I thought even Pascal was a stretch. Cue my surprise when I got a report of someone casually firing it up on Maxwell when I was pretty certain there was no way it could work.
A Maxwell laptop chip. It also runs just as well on an H100.
THAT is hardware support.
PyTorch is already walking down this path and while CUDA-based performance is significantly better, that is changing and of course an area of continued focus.
It's not that people don't like Nvidia, rather it's just that there is a lot of hardware out there that can technically perform competitively, but the work needs to be done to bring it into the circle.
I sure hope I'm wrong.
https://www.investopedia.com/terms/o/oligopoly.asp
With that few competitors pricing would not change much.
In the gaming market for GPUs, Nvidia has no competition except in some niche areas. Overall, their lead in upscaling software is too commanding so they can price how they want. Customers are paying 15-20% premiums for the same raw hardware performance, all to access Nvidia's DLSS, because there's no good competition.
It would be a painful reverse engineering process - the cuda file format is sort of like elf, but with undocumented bonus constraints, and you'd have to reverse the instruction encoding to get sass, which isn't documented, or try to take it directly to ptx which is somewhat documented, and then convert that onward.
It would be far more difficult than compiling cuda source directly. I'm not sure anyone would pay for a cuda->amdgpu conversion tool, and it's hard to imagine AMD making one as part of ROCm.
I'm not here to desparage anyone experiencing issues, but my experience on the NixOS rolling-release channel has also been pretty boring. There was a time when my old 1050 Ti struggled, but the modern upstream drivers feel just as smooth as my Intel system does.
Good to hear more than a cheap snub. OpenAI Triton as the reason other GPUs work is a real non-shit answer, it seems. And interesting to hear JAX too. Thank you for being robustly useful & informative.
You can also actually buy them as opposed to the nVidia offerings which you are going to have to fight for.
https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.
EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )
It easily is. See the benchmarks[0] from faster-whisper which uses Ctranslate2. That's 5x faster than OpenAI reference code on a Tesla V100. Needless to say something like a 4080 easily multiplies that.
> https://www.tomshardware.com/news/whisper-audio-transcriptio... is a good example of Nvidia having no excuses being double the price when it comes to Whisper inference, with 7900XTX being directly comparable with 4080, albeit with higher power draw. To be fair it's not using ROCm but Direct3D 11, but for performance/price arguments sake that detail is not relevant.
With all due respect to the author of the article this is "my first entry into ML" territory. They talk about a 5-10 second delay, my project can do sub 1 second times[1] even with ancient GPUs thanks to Ctranslate2. I don't have an RTX 4080 but if you look at the performance stats for the closest thing (RTX 4090) the performance numbers are positively bonkers - completely untouchable for anything ROCm based. Same goes for the other projects I linked, lmdeploy does over 100 tokens/s in a single session with LLama2 13b on my RTX 4090 and almost 600 tokens/s across eight simultaneous sessions.
> EDIT: Also using CTranslate2 as an example is not great as it's actually a good showcase why ROCm is so far behind CUDA: It's all about adapting the tech and getting the popular libraries to support it. Things usually get implemented in CUDA first and then would need additional effort to add ROCm support that projects with low amount of (possibly hobbyist) maintainers might not have available. There's even an issue in CTranslate2 where they clearly state no-one is working to get ROCm supported in the library. ( https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomm... )
I don't understand what you're saying here. It (along with the other projects I linked here[2]) are fantastic examples of just how far behind the ROCm ecosystem is. ROCm isn't even on the radar for most of them as your linked issue highlights.
Things always get implemented in CUDA first (ten years in this space and I've never seen ROCm first) and ROCm users either wait months (minimum) for sub-par performance or never get it at all.
[0] - https://github.com/guillaumekln/faster-whisper#benchmark
[1] - https://heywillow.io/components/willow-inference-server/#ben...
[2] - https://news.ycombinator.com/item?id=37793635#37798902
that's not embedded dev. if you
1. use underpowered devices to perform sophisticated tasks
2. using code/tools that operate at extremely high levels of "abstraction"
don't be surprised when all the inherent complexity is tamed using just more layers of "abstraction". if that becomes a problem for your cost/power/space budget then reconsider choice 1 or choice 2.
If we are talking about 3D-Now, that is long dead and buried. If we are talking about the latest AVX-whatever, not even Intel is consistent, with different processor families supporting different subsets and applying different clock policies.
Yes it was Intel's own spec, so of course they're gonna implement it first, but that's exactly what I mean. This is a recurring dance, and I'll pay a little more for the one that sets the standard. If this weren't a thing, they'd both just be commodity.
I think my issue is more just with the mindset that it's okay to have one narrow slice of supported versions of everything that are "known to work together" and those are what's in the container and anything outside of those and you're immediately pooched.
This is not hypothetical btw, I've run into real problems around it with libraries like gproto, where tensorflow's bazel build pulls in an exact version that's different from the default one in nixpkgs, and now you get symbol conflicts when something tries to link to the tensorflow c++ API while linking to another component already using the default gproto. I know these problems are solveable with symbol visibility control and whatever, but that stuff is far from universal and hard to get right, especially if the person setting up the build rules for the library doesn't themselves use it in that type of heterogeneous environment (like, everyone at Google just links the same global proto version from the monorepo so it doesn't matter).
I hear you. I think docker has been a plague on the quality of software. It's allowed "works for me" to become the norm, except it's now pronounced "works on the official docker image". It seems to be especially true in the ML sphere where compiling things is so temperamental that there's a lot of binaries being distributed.
Docker was meant to be a deployment platform, not a distribution medium.
You need to ensure that there is only one version of any library used globally throughout the code and that the set of versions is compatible with each other, and preferably you also want everything to be built against the same toolchain with the same flags.
That usually means onboarding third-party libraries into your own build system.
Wanting to hold Python+C ecosystem more accountable is fair I think, at least from my own experience around half a year ago, Anaconda doesn't work and you need a Dockerfile for any sort of reproducibility, which can have issues since GPU with docker isn't that easy. And this means developers from the vendors working with Anaconda, for example, on solving the issue rather than just hoping for contributors to do it. If AMD were to make easy, reproducible builds without root or VM a reality, that would be reason enough to try their hardware. If not, hopefully Nvidia does and then there really would be no way across the moat for me at least.
Also, just to name it, it’s ridiculous that a specific graphics card manages to restrict the version of gproto that you’re using. You don’t have this problem with nvidia drivers, since cuda stuff is much less fiddly. AMD needs to pull a finger out and fix the bugs in their stack that make it so fragile like this.
Or rather, I install no versions of libraries because NixOS will put them all in the store in different folders, and will compile the executable to use the correct path (or patch the elf when needed)
it has an issue with pip because it's allergic to just randomly executing things as part of package management, but pip in general is wtf
Well, except for cuda. Which is a massive pile of proprietary software that people are using in production anyway.
Heh, if only. When working with F100's I've seen many terrible, terrible things.
There is some strange issue with some games where they don't get full performance from the dGPU, but more than the iGPU. I have to use optirun to get full performance.
It also has problems when the computer wakes from sleep. For whatever reason, hardware video decoding doesn't work after entering standby. Makes steam in home streaming crash on the client, but flipping to software decoding usually works fine.
The important part is that battery life is almost as good with bumblebee as it is with the dGPU turned off. No more fucking with Prime or rebooting into BIOS to turn the GPU back on.
Eg: https://www.intel.com/content/www/us/en/developer/videos/opt...
https://www.intel.com/content/www/us/en/developer/tools/onea...
https://developer.apple.com/metal/tensorflow-plugin/
Large scale opensource is, outside of a few exceptions, built by engineers paid to build it.
AMD are being dragged along by the market. Willingly, they aren't fighting it, but their focus has been on other areas.
They've shifted a large pool of experienced engineers from legacy software projects to AI and moved the team under a veteran Xilinx AI director. Fingers crossed we should see significant changes in 2024.
https://www.fool.com/earnings/call-transcripts/2023/08/01/ad...
it's literally ALL AI, server, enterprise talk - AI is mentioned 64 times
AMD literally doesn't care about gaming anymore, server is their primary focus
The API level I could target was at least two or three versions behind the latest they have to offer.
But this is one of the great strengths of CUDA: I can develop a kernel on my workstation, my boss can demo it on his laptop and we can deploy it on Jetsons or the multi-gpu cluster with minimal changes and i can be sure that everything runs everywhere.
I just don't understand the details enough to understand why things are problematic without CUDA :(
Some architectures provide fast F16->F32 and F32->F16 conversion instructions so you can DIY the memory bandwidth saving - that always seemed reasonable to me, but I don't know if the AMD hardware people are/will go down that path.
Keeping all those FPUs busy is another problem and not easy, but in cases where it can be done FP32 is clearly desirable.
I have come to accept that graphics card drivers and hardware stability ultimately comes down to whether or not ghosts have decided to haunt you.
Even my 4GB RX 570 from years ago gives a better experience doing this. You just install OBS from flathub, Wayland works, everything works without any setup or tinkering. You click record and you can record your gameplay footage.
It works flawlessly.
Never used Wine + OBS, though.
just ALL videos on my system are broken, I can't play back video past like half speed so the sound gets really choppy
Thanks, Nvidia
Anyway, no judgement, just my POV.
Nvidia has 80% market share of the discrete GPU desktop market and at least 90% market share of cloud/datacenter.
Nvidia GPUs are used almost exclusively for every cloud powered AI service and to train virtually every ML model in existence. Almost always on Linux.
Do you really think any of this would be possible if what you are describing was anything approaching the typical experience starting at the /driver/ level?
Nvidia would have never achieved their market dominance nor held on to it this long if the issues you’ve experienced impacted anything approaching a statistically significant number of users or applications.
Nvidia gets a lot of hate on HN and elsewhere (much of it fair) but I will never understand the people who claim it doesn’t work and get the job done (often very well).
Gaming users may tolerate some flakiness for their hobby but these AI companies dealing in the nine-figure range (minimum) absolutely do not.
If you're AMD and NVIDIA and lowering the price would double the number of customers, you might very well want to do that, unless you're supply constrained -- which has been the issue because they're both bidding against everyone else for limited fab capacity. But that should be temporary.
This is also a market with a network effect. If all your GPUs are $1000 and nobody can afford them then nobody is going to write code for them, and then who wants them? So the winning strategy is actually to make sure that there are kind of okay GPUs available for less than $300 and make sure lots of people have them, then sell very expensive ones that use the same architecture but are faster.
That has been the traditional model, but the lack of production capacity meant that they've only been making the overpriced ones recently. Which isn't actually in their interests once the supply of fab capacity loosens up.
> This is also a market with a network effect
That is what the demand curve describes. In your hypothetical that would mean the demand curve is more vertical in slope.
Higher prices are more likely due to several items. Participants willing to pay more (crypto and ai). But also less companies making the things than 15 years ago so less supply and oligopoly style pricing. Plus one company being the hinge pin on building the chips and another company consuming large portions of its supply. The supply curve shifted left and up. While the demand curve is going the up and right. There is no 'one thing' that causes it. But oligopoly pricing is very much in effect. With 3 companies making the things.
> Which isn't actually in their interests once the supply of fab capacity loosens up
Which would change the supply curve and they would re-evaluate which way to move the price. That could mean bad things or nothing happens other than possible lower prices (eventually).
Game "AI" is meant to be fun to play (and win) against, they're not meant to be smart; that's why zombie games are so successful. Most "game AI" are finite state machines, throwing a neural network at the issue would be absurd overkill.
I'm sure there will be some AI applications in games (like procedural world generation or such, perhaps) but it's not the obvious connection that most people think.
Ryzen 9 7950X — $799 on release Intel 13900K - $589.
A docker container is not really any different from any other process; the main difference is that it runs in a chroot pretty much.
But that has nothing to do with semver.
Semver gives you information about when when you can replace one version with another version. It doesn't promise that you can mix multiple versions together.
And you are mixing multiple versions if you are building against version x.y and linking against version x.(y+z).
yes and in those instances you do not reach for pytorch/tensorflow on top of ubuntu on top of x86 with a discrete gpu and 32gb of ram. instead you reach for C and micro or some arm soc that supports baremetal or at most rtos. that's embedded dev.
so i'll repeat myself: if you want to run extremely high-level code then don't be "surprised pikachu" when your underpowered platform, that you chose due to concrete, tight budgets doesn't work out.
However, containers or Ubuntu Linux don’t perform great in that environment. Ubuntu is for desktops, containers are for cloud data centers. An offline stand-alone device is different. BTW, end users don’t typically aware that thing is a computer at all.
Personally, I usually pick Alpine or Debian Linux for similar use cases, bare metal i.e. without any containers.
Tell that to their (much larger, more profitable, and better-funded) server org. This is far from true.
Or at least you have some case as heaven never come. Or come just we do not aware now like internet. Can you need to use ibm to rub sna to provide a token ring based network. In 1980 …
Imagine and let us or they competite …
https://twitter.com/realGeorgeHotz/status/166980346408248934...
That sounds interesting. I tried googling about it but can't really find much other than that failoverflow found a key and didn't release it, and then geohot released his own subsequently. I'd love to hear more about how directly he "stole" the work from the Fail0verflow team.
edit: Reading some sibling comments here, it seems you are either mistaken and/or were exaggerating your claim about the "theft" here. As far as I can tell, he simply took their findings and made his own version of an exploit that they had detailed publicly. That may be in poor taste in this particular community but it's certainly not theft. I do agree that his behavior there was lacking in decency, but not to the degree implied here where I was thinking he _literally_ stole their exploit by hacking them, or something similar to that.
Fail0verflow demoed how they were able to derive the private signing keys for the Sony Playstation 3 console at I believe CCC
Geohot after watching the livestream raced into action to demo a "hello world!" jailbreak application and absolutely stole their thunder without giving any credit
In any case he absolutely did credit them, it's easily verifiable: https://web.archive.org/web/20110104040706/http://geohot.com...
Sony sued them both, afterall!
I had GPT-4 do some research for you, hopefully you will incorporate it in future comments you make about me. https://chat.openai.com/share/d0fa24e9-3ed7-4b17-8497-24bfdd...
If linking against a different version of the code breaks like that, that sounds like someone did semver wrong. If that happens a lot to you, then oh, I'm sorry about that happening.
On Desktop you have to worry about things like... UIs, sound, Wine, etc.
But I can promise you after reading things like the LKML for decades and a number of different Microsoft blogs, that everyone on this planet experiences flakiness issues at times and has to figure out how to adjust their workload to avoid it until the issue is discovered and fixed.
Actually, no. Obviously they have Nvidia support but in one especially obscure issue he was describing Meta took it as an internal challenge and put three teams on it in competition. Naturally his team won (of course) ;).
Of course all software has flakiness - I'm not taking the ridiculous position that Nvidia is the first company in history to deliver perfect anything.
What I am saying is these anecdotal reports (primarily from Linux desktop hobbyists/enthusiasts) of "It's broken, it doesn't work. Nvidia sucks because it locked up my patched kernel ABC with Wayland XYZ on my bleeding edge rolling release and blah blah blah" (or whatever) are extreme edge cases and in no way representative of 99% of the Nvidia customer base and use cases.
Show me anything (I don't care what it is) and I'll find someone who has a horror story about it. Nvidia gets a lot of heat from the Linux desktop situation over the years and some people clearly hold an irrational hatred and grudge.
Nvidia isn't perfect but it's very hard to argue they don't deliver generally working solutions - actually best of breed in their space as demonstrated by their overwhelmingly dominant market share I highlighted originally.
1. They supported linux when no one else did, 2. I've never experienced instability from their drivers, and as I mentioned before, I've been running their cards under linux since the TNT2 days.