CUDA Is Still a Giant Moat for Nvidia(weightythoughts.com) |
CUDA Is Still a Giant Moat for Nvidia(weightythoughts.com) |
This serious bug was open since May and AMD doesn't seem to respond as seriously as it should be.
Isn't geohot infamous for stealing other people's work?
PBCAK?
That said, ROCm only officially supports a fraction of its product line, and an odd smattering throughout at that. It's a joke compared to CUDA which will run on damn near anything. And AMD has a long, long history of dogshit drivers (at least on Windows.)
AMD just doesn't seem to give enough of a shit to invest money into securing top talent for this, and NVIDIA will continue to stomp them.
Are you meaning the Sony Playstation hacking where they took legal action against him, or are you meaning other stuff?
Shareholders of AMD should look into it and do some firings of top Executives/CEO until morale improves.
A long time ago AMD decided to 100% focus on budget consumer graphics (including consoles), that decision was the right decision at the time. However being in low-margin business it seems they don't have the people (or the budget to last-minute hire) to pump out the R&D for a generic neural network platform without moving people away from their consumer graphics division.
The article is unsatisfying because it doesnt explain WHY cuda reigns supreme.
One hypothesis put forward is that the main alternative ROCM is just not very complete and not very fast - thats a good argument.
Another hypothesis that is not considered is : CUDA reigns supreme, because NVIDIA GPUs reign supreme.
But people dont write CUDA code .. they write pytorch code ?!
To first order nobody writes any CUDA, and even if you do you are probably bad at it. The language is slightly easier to use than openCL but writing really performant code is still a nightmare (a pipeline of asynchronous memory copies from global to shared memory is not easy to program but this is a requirement for full performance on tensor cores).
So no, the moat really isn't the language. It's not even the libraries, it's the integration of the libraries into third party software like pytorch, jax, etc. This is the truly massive advantage NVIDIA has, and they got it by being early and by being installed in an awful lot of machines.
I used to work in the GPU industry and this sort of view is both pervasive and misguided.
GPUs are immensely complex machines. It is really hard to get them to work, let alone work with high performance.
Because of this, and in spite of the amount of time and resources spent on validation and verification, the hardware often contains flaws. It is the responsibility of the drivers to work around these flaws in various ways. When a flaw hasn't been discovered and worked around yet, you perceive it as the GPU being unstable or crashing.
There is no fast simple solution to this. You need a finely tuned corporate machine from beginning to end. Better hiring processes, better management, better design processes, better verification processes, better software development practices, better marketing and sales, better customer relations. Everything.
This is like saying combustion engines are immensely complex machines when your car suddenly loses power on the highway for no apparent reason and then when you restart the engine it works for another five minutes again. When you drive on normal roads it works flawlessly. It must be the engine, right? After all, it is the most complicated aspect!
Except in reality it is far more likely for it to be a problem in the electronics driving the fuel pump or spark plug.
AMD most likely has some sort of buffer overflow or deadlock in their GPU drivers that is causing difficult to diagnose problems. It is very unlikely that the silicon itself is broken when it works fine for playing video games and it also works fine when your GPU is one of the few officially supported by ROCm.
Pretty bad idea, especially in midst of the AI hype.
why can't xyz company build apps/websites/products that don't have bugs??
I believe LLMs will be commoditised while the compute power will be the next big thing.
not if this moat could be leveraged into a monopoly on AI chips, to the detriment of society.
I want to see competition in this space.
Unfortunately, the market rally of nvidia stock is suggesting that most investors are expecting this monopoly to eventuate.
Therefore, it is in the interest of society to ensure that such a software moat is not established. Look what happened to the web browser when microsoft held a monopoly on it, and look at what is happening with chrome, apple appstore, etc.
Realistically what happened is that after a few decades of development, competitors arose and took the market. In the meantime, Microsoft became rich. Who cares
Can you talk more about this? Would love to understand.
Intel should be shoveling out 16GB Arc graphics cards for free to every graduate program in the country who can fill out a web form. In a couple years, they'd displace NVIDIA.
AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.
Yes, there has been progress. However, when you look at the amount of money that AMD and Intel throw at software vs how much NVIDIA throws at software, it's an instant facepalm moment.
NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.
Many microcontroller companies have terrible software support: no free C/C++ compilers, clunky IDEs, too much reliance on 3rd party software providers, no decent code libraries...
Even if they have software support, the code is bad and bloated. Look at ST's HAL libraries, for example. Thankfully, an open source or free tool often comes to the rescue, usually through the efforts of dedicated individual programmers. But billion-dollar companies relying on such 3rd party tooling seems insane to me.
AMD recently got rid of one of the CUDA compatibility layers instead of extending it.
And they need to release high-RAM versions of their next gaming GPUs. More than anything else that will incentivize people to switch. If they're selling 36 GB while Nvidia is still selling 24 GB, people will do what it takes to move over.
This takes a ton of employees which is hard for a company with a fraction of the software employees of Nvidia. (On that note there's 1185 engineering job postings on the AMD site right now... https://careers.amd.com/careers-home/jobs?categories=Enginee...)
"They" (being AMD) didn't. The person they contracted put in a clause that allowed him to open source the work (years AFTER) AMD stopped paying him.
- Abandoning ZLUDA was maybe not the best choice
- Not accepting the fact that software is equally as important as hardware is wrong
- Pushing more vram into their cards would attract more people
- Fix hardware issues (especially with the restarts on every fail) should be high priority
Chip War has a great section on how the Soviet Union tried a “just copy/steal” strategy in semiconductors and fell hopelessly behind because of it. It’s a great theoretical idea to just copy/steal and fast-follow, but semiconductors, AI, and other “harder technologies” require building human and intellectual capital that will get better with time. From there, you need to have the prior generation to keep up with ever-increasing complexity and difficulty as these things get more advanced.
I disagree with your section on Huawei and China. China isn't just trying to just copy/steal AI. In terms of models, China is a bit behind in LLMs but arguably more ahead in self-driving cars. China is throwing everything at semiconductor manufacturing instead because that's where their bottleneck truly is - not CUDA. Had Huawei had access to TSMC's 5nm and 3nm, they might already be equal to Nvidia in raw GPU prowess. After all, HiSilicon's Kirin already matched/exceeded Qualcomm before the Trump ban. Their 5G chips/implementation were well ahead of anyone else. In software, it's easier for China to adopt a CUDA alternative because China is usually really good at unifying under one vision - especially when they have to.The problems you generally experience are:
* Inexplicably poor performance
* Poor (and sometimes incorrect) documentation
* Difficulties debugging
* Crashes and hangsIf I'm AMD, I'd spend at least $1 billion/year figuring out the software side.
I can't think of an easier way for AMD to return value to shareholders than eroding CUDA advantage.
Heck, Meta invested something like $100b on VR so far and VR is not nearly the market that AI is.
I started playing around with porting some CUDA code to ROCm/HIP on a Ryzen laptop APU I had. While an "unsupported" configuration (which was understood), it all worked until AMD suddenly and explicitly blocked the ability to run on APU's. Currently the only way to get back to work on that project on that particular computer would be to run a closed-source patched driver from some rando on the internet. Needless to say, I lost interest.
Last I checked, there were only 7 consumer SKU's that could run AMD's current compute stack, the oldest being 1 generation old. Even among the enterprise hardware they only support ~2 generations back. So you can't even grab some old cheap recycled gear on e-bay to hack on their ecosystem.
Meanwhile, I can pull anything with an NVIDIA logo on it from a junkyard it'll happily run CUDA code that I wrote for the 8800GTX 15+ years ago.
Then there is the quality of hardware, debugging tools, IDE support, supported languages (again isn't only PyTorch), and libraries.
I know its still in development. But curious to know if someone has played around with it for the kind of needs discussed on this page.
PyTorch already does. But if you're saying "NN" and "pytorch" that already means you're outside of the audience for CUDA I'm talking about in the article. My own stuff was usually Bayesian Hierarchical Models, which at least at the time made pytorch completely useless (that was nearly a decade ago though—maybe that specific use case improved).
If you've tried to write actually new (or different enough) NNs or entirely different models, pytorch is too high-level, and sometimes even TF is too. Even aside from that, if you're a maintainer of BLAS or some specific library for sparse MM with very specific distributions that are optimized for it...
Anyway, those are the key cases, but even aside from that, if you've ever tried even with some higher-level libraries to do non-vanilla stuff, nothing works as well as it should. You get random, inscrutable errors that certainly do exist on NVIDIA GPUs/stuff-based-on-CUDA-under-the-hood, but way way fewer. For newer, custom stuff, getting things like numerical overflows or other completely breaking problems on alternative backends, but don't happen / work just fine on CPU or CUDA backend is not really that uncommon. Or the CUDA backend is just ridiculously faster. If you're doing something annoying, new, and complicated enough, there's no point in taking the aggravation.
The people who write the stuff that is used in PyTorch or other libraries definitely write CUDA code (in C++ etc). And then the people who use PyTorch just build on top of that.
I deliberately tried to keep it accessible and have non-technical (or just non-software) audiences also be able to get an intuition for why CUDA has such strong lock-in. Otherwise, the pushback I've often gotten "just re-write it" or "it's just software" which if it were so simple, people wouldn't need to be yelling so much at AMD across so many comments. Basically, people who can't fathom why software technical debt can ever be a thing. Or, if it is, China has infinite money and time anyway.
A high-level analysis should say that Huawei, AMD, and Intel all should easily invest enough to make this all work and compete with CUDA to push their hardware platforms. The reality is decentralized decision-making from users also makes it more of an expensive, uncertain bet that people will adopt. A bunch of the lower-level, underlying libraries that things are built on AND the researchers who do bleeding-edge research still have a huge amount of experience in and stuff built on CUDA.
At least say why people wouldn’t be good at it. The documentation is poor, the GPUs are a black box or anything in that vein. Then they can help you learn instead of preemptively dismiss it.
I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS.
NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public.
It would be like trying to compete against Olympians, to use an analogy that we can all understand.
You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.
It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:
* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.
* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.
* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.
* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.
This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.
It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.
Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.
Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?
That's an extreme stretch, and far from truth.
Many people write CUDA, both in industry and academia.
I think Nvidia sees it too. That's why they're moving upstream by providing the entire stack from CUDA, GPUs, interconnects chips, networking chips, racks, OS, software, models.
I think the "CUDA moat" people like OP are underselling Nvidia. They're positioning themselves as the full-stack AI provider. Forget CUDA.
- Great at legacy C++ code.
- Great at new C++ code.
- Great at embedded/high performance/distributed code.
- Are experts in Linear Algebra and Calculus
- Are competent at Machine Learning and similar problems.
Now imagine, that after you find ~10-50 competent senior engineers who can each segment and train 1-5 engineers, you also need to hire 10-20 managers, PMs and directors who are smart enough to do more than "copy NVidia's offering from last year", and wise enough to still build a 1:1 compatibility layer.
Apple is likely seeing more traction on their metal API by virtue that it is reasonably well guaranteed to be around in ~5 years, and is common on multiple device platforms that students/devs use or customers deploy.
It gets even stranger when considering that as major GPU makers, both AMD and Intel have lots of access to such talent.
My personal experience shows CUDA to in fact be a very deep moat. In ~12 years CUDA and ~6 ROCm (since Vega) I’ve never met a professional who says otherwise, including those at top500.org AMD sites.
From what I’ve seen online this take really seems to come from some kind of Linux desktop Nvidia grudge/bad experience or just good ‘ol gaming/desktop team red vs green vs blue nonsense.
Many things can be said about Nvidia and all kinds of things can be debated but suggesting that Nvidia has > 90% market share simply and solely because people drink Nvidia kool-aid is a wild take.
Isn't that what HIPIFY does? https://github.com/ROCm/HIPIFY
https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/...
Thank you for sharing your opinion. My experience writing GPU device drivers was different.
Drivers are relatively simple compared to the underlying hardware and the hardware programming interface between the two reflects that. As a result of that, driver developers spend a ton of their time chasing down hardware bugs. Drivers are also intrinsically simpler to debug, not just because they are smaller but also because you often have better tools to inspect what is going on.
Another factor to consider is that software bugs are fixed, while hardware bugs are most often worked around in software. This is done out of necessity, because the process of spinning a new hardware revision is extraordinarily expensive and avoided at all cost.
But again, it's just how things went down in my personal experience and yours may be different.
Metal has 20% of the desktop market, and whole of the iOS/iPad/watchOS markets combined.
Even with Android market share, many folks keep using OpenGL ES, because Vulkan tooling on Android sucks and isn't available to Java/Kotlin developers like OpenGL ES is, so only game engines like Godot/Unreal/Unity make use of Vulkan in practice.
The answer is that the entrenchment of the tools, software and inertia of a defacto standard is what prevents new entrants. The time to stop it is to nip it in the bud. Prevent monopoly from forming, rather than hope that after the monopoly forms, some competitor will break it.
instead, what you have today is this:
> You may not reverse engineer, decompile or disassemble any portion of the output generated using SDK elements for the purpose of translating such output artifacts to target a non-NVIDIA platform.
from https://docs.nvidia.com/cuda/eula/index.html#limitations section 8
The point I was glibly trying to get across was that even a small effort on the part of AMD to treat the SW side as seriously as NVidia does would have yielded great benefits, and not have left them so far behind.
Also, there is a lot of work going on in the gcc & llvm toolchain to not only use OpenMP to target accelerators in computationally intensive loops but, in the case of llvm, to also target tensor instructions for more efficient code generation (https://lists.llvm.org/pipermail/llvm-dev/2021-November/1537...).
It took the AI folk less than 18 months to almost completely move away from CUDA to Tensorflow and then PyTorch... LLVM, imho, is going to do the same for Sci/Eng and general code bases in the next 2 years.
But with GPU target support in LLVM, in most cases you won't need to resort to CUDA anymore.
Brought down some simulations from about 30min to under 1s.
My point in the article was basically the class was "indoctrinating" (too strong, but you get the point) the future ML researchers in the superiority of using CUDA and spending NVIDIA company resources to continuously do so in these classes, year after year.
If you could compile CUDA for Intel and AMD it's not going to perform well. When you program a GPU you aren't just writing task specific code, you are also writing hardware specific code. So having developer mindshare matters much more than having a nice programming language.
In ML many people write pytorch and not CUDA. But even in ML the choice of precision is driven by the data types Nvidia can deal with efficiently - this is a moat which is nothing to do with CUDA.
The world is deeper than just assembly and BLAS tuning, and you can get extremely far in CUDA just by gluing together the primitives they give. Python is popular in the AI/ML space, but far from the only way to do that.
If there is a niche that is at the intersection of multiple specialties, and it includes GPU acceleration, there is a good chance it is ripe for a startup to get an early mover advantage. Eg, real-time foundation models for audio around non-english/non-chinese that works small & offline in cars.
Unfortunately, Nvidia has a culture of open sourcing all CUDA code, so if any startup shows something works commercially, Nvidia will rewrite, likely ultimately better, and give away for free, so more companies will do it and buy more GPUs.
What do you think about Apple's Metal?
But it's not easy for the hw co's. OpenCL was more of a hw company thing (Intel, AMD, mobile chip co's), and while they spend billions on adventures all the time, their SW leadership culture has been bad. They fail to do sustained & deep ecosystem investment, and instead look like small feudal orgs that get their projects pulled arbitrarily whenever the VPs rearrange themselves. For example, given that Intel brought back its old CEO, that was a scary signal to me for this front. Intel specifically had the internal talent, I'm not sure if they still do, just not at the management level, and definitely not culturally at the highest leadership level.
Jensen at Nvidia has always been a special CEO here, even when they were helping game companies make their engines, and I'm guessing that taught him the value of long-term vertical SW & ecosystem investment. Instead of Intel unifying on x86 and c++ (compilers, vtune, Intel tbb, ...), and letting Microsoft / Linux / DB people go higher, Jensen went all the way up the stack to get at full utilization, and unified teams internally on that over 1-2 decades.
Apple is a funnier case. I can see them doing it and then pulling the plug. Eg, Chris Lattner making Swift and then they failed to retain him, and their revolving door of frameworks overall. Internally, they do have the technical talent and $, but I don't understand the culture and commercial alignment.
Finally.. I do think the increasing importance of AI inferencing, yet simultaneous simplicity of it, has opened a disruption opportunity here. We are still at a tiny % of where it is going. Onyx, pytorch, transformers, etc ecosystem are still early days from that perspective. It's fast for a hardware co like Groq to port a new model. So I don't rule out big changes here, and those being used to drive the rest of the ecosystem, like your q on ROCm.
You should not confuse AMD's general & long-standing indifference/incompetence wrt SW with the actual difficulty of providing a portable SW path for acceleration. As Woody Allen once said: "90% of success is showing up"
But what happened in AI, when, in a very short period of time, almost everyone moved away from writing their directly in CUDA, to writing them in frameworks like Tensorflow & PyTorch is all the evidence anyone need to show just how unsound that SW obstacle is.
Ah yes, pytorch:
1) Check issues, PRs, etc on torch Github. Considering market share ROCm has a multiple of the number of open and closed issues. There is still much work to be done for things as basic as overall stability.
2) torch is the bare minimum. Consider flash attention. On CUDA just runs of course with sliding window attention, ALiBi, and PagedAttention. ROCm fork? Nope. Then check out the xFormers situation on ROCm. Prepare to spend your time messing around with ROCm, spelunking GH issues/PRs/blogs, etc and going one by one through frameworks and libraries instead of `pip install` and actually doing your work.
3) Repeat for hundreds of libraries, frameworks, etc depending on your specific use case(s).
Then, once you have a model and need to serve it up for inference so your users can actually make use of it and you can get paid? With CUDA you can choose between torchserve, HF TEI/TGI, Nvidia Triton Inference Server, vLLM, and a number of others. vLLM has what I would call (at best) "early" support that requires patches to ROCm, isn't feature complete, and regularly has commits to fix yet another show-stopping bug/crash/performance regression/whatever.
Torch support is a good start but it's just that - a start.
One of the first teams that ported LAPACK to CUDA or CULA are apparently being paid handsomely by Nividia [1],[2].
Interestingly, DCompute is a little known effort to support compute on CUDA and OpenCL in D language, and it was done by a part-time undergrad student [3].
I strongly believe we need a very capable language to make advancement much easier in HPC/AI/etc, and D language fit the bill very much and then some. Heck it even beat other BLAS libraries that other so called data languages namely Matlab and Julia still heavily depended on for their performances to this very day. It does it in style back in 2016 more than seven years ago [4]. The DCompute implementation by the part-timer in 2017 actually depended on this native D implementation of these linear algebra routines in Mir [5].
[1] CULA: hybrid GPU accelerated linear algebra routines:
https://www.spiedigitallibrary.org/conference-proceedings-of...
[2] CUDA Spotlight: John Humphrey:
https://www.nvidia.com/content/cuda/spotlights/john-humphrey...
[3] DCompute: GPGPU with Native D for OpenCL and CUDA:
https://dlang.org/blog/2017/07/17/dcompute-gpgpu-with-native...
[4] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:
http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...
[5] DCompute: Native execution of D on GPUs and other Accelerators:
But I'm one of those old-school HPC guys who believes that libraries are mostly irrelevant, and absolutely no substitute for compilers and targeted code generation.
Julia is cool, btw. It could very well end up supplanting Fortran, once they fix the poor performance code generation issues.
Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.
I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.
Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.
I know you probably don't mean to say that Nvidia can't write good CUDA, but this does sort of illustrate how hard that is. I've seen similar cases (tiny matrix multiplied by enormous matrix) in which it was possible to write something faster than Nvidia's library. I'm not sure if this has been addressed since though.
> they can be beaten on memory bandwidth with pretty lowly-optimized kernels
This is partly why I believe most CUDA code probably isn't "good" - there's this enormous gulf between acceptable and good which often isn't worth crossing.
I also meant to say that the domain is full of low hanging fruits if your problem doesn't fit whatever NVIDIA didn't optimize deeply. An intern may beat the cuXXX libraries with a little work and you can work up to max perf, yes, with serious effort.
There is probably thousands of man hours plunked in BLAS on Intel hardware and anyone who seriously tried to do AVX2/AVX512 knows it's hard to reach actual max perf on all problems. Yet I don't read 'only Intel experts can code efficient code'. It's no more true for CUDA than other parrallel or memory-weird architectures I've worked on. Yes it's different, but getting max perf has always been hard on any modern hardware.
As for the gulf between acceptable and good, the problem is similar here too: people stop when they've reached their goal or feel they can scale more efficiently by other means. I really don't see the difference with heavily optimized x86 stuff. We keep seeing new stuff you can do to improve AVX512 code or new places where you can apply it (JSON parsing, utf validation...) and it's been out for a while too. There hasn't been any free lunch there for a long, long time.
Congratulations, it sounds fascinating. Looking forward to seeing your contributions to pyTorch.
Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.
What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...
I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).
I really wished any modern language should try supplanting Fortran for HPC and personally my bet is on D.
[1]DMD Compiler as a Library: A Call to Arms:
If I have lost track of the conversation, please accept my apologies.