CUDA-oxide: Nvidia's official Rust to CUDA compiler

CUDA-oxide: Nvidia's official Rust to CUDA compiler(nvlabs.github.io)

424 points by adamnemecek 52 days ago | 118 comments

arpadav 52 days ago |

This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.

im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...

the__alchemist 52 days ago | |

Cudarc slaps!

> Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.

I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)

arpadav 52 days ago | | |

i'll have to check this out, thanks!

jauntywundrkind 52 days ago | |

Do other people agree cuda-oxide looks like a near dorp in replacement for cudarc?

That would be amazing, but probably not imo complementarily so.

I am curious what distinguished cuda-oxide. Beyond it being totally under nv control.

arpadav 52 days ago | | |

perhaps not drop-in, but all my workflows with cudarc have always been "i make cuda kernel, i use cudarc for ffi to said kernels, i call via rust" - which for this case is pretty analogous

briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU

but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis

the__alchemist 52 days ago | | |

I am observing the same from the article... is it heavily inspired by Cudarc, i.e. is this intentional, or are we reading too much into this, given Cudarc is a light abstraction over the CUDA api?

cyber_kinetist 52 days ago |

I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)

arpadav 52 days ago | |

the main 4 i see are:

1. use-after-free, drop semantics vs manual cudaFree

2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only

3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`

4. im pretty sure you can cuda memcpy a std::string and literally any other POD and "corrupt" its state making it unusable. here it ONLY accepts DisjointSlice<T>, scalars, and closures (https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...)

but most of the nitty gritty is in these sections

* https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...

* https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...

edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files

wrs 52 days ago | |

This is explained in some detail in the docs. There is a safe layer, a mostly safe layer, and an unsafe layer. Some clunkiness is needed for safe-yet-parallel work that they couldn’t easily fit into the Rust Send/Sync model.

simonask 51 days ago | |

FWIW, Rust’s memory model is more or less completely identical to C++’s, by design. Atomics work the same, there’s provenance, and so on.

Whether it is a convenient language for GPU programming probably remains to be seen, but I definitely wouldn’t be surprised if you could make a decent DSL-like API for writing safe code that leverages the full spectrum of GPU oddities. That’s what CUDA is, right?

pjmlp 51 days ago | | |

Originally CUDA hardware was designed without a specific memory model, after C++11, NVidia went into a multi year effort to redesign the hardware to match C++ memory model semantics.

CppCon 2017: "Designing (New) C++ Hardware”

https://www.youtube.com/watch?v=86seb-iZCnI

the__alchemist 52 days ago | |

I think it depends on the objective. My pattern-matching brain says there will be interest in addressing this.

From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.

The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.

raincole 52 days ago |

I wonder what it means for Slang[0]. Presumably the point is that people want to do GPU programming with a more modern language. But now you can just use Rust...

(Disclaimer: I like Slang a lot.)

[0]: https://shader-slang.org/

pjmlp 52 days ago | |

They serve different public, Slag folks are more interested in graphics programming not AI algorithms.

Also shading languages are more user friendly given their features.

Finally NVida already has Slang in production and those folks aren't going to rewrite shader pipelines into Rust.

mohamedkoubaa 52 days ago | | |

I am working on a graphics library that integrates slang into rust: https://github.com/koubaa/goldy

There's library code in rust that manages GPU memory and schedules pipelines and use a slang reflection to ensure memory layouts between rust and shaders match.

Oh and it supports metal/vulkan/dx12

simonask 51 days ago | |

Writing shaders is materially different from writing CUDA kernels, at least for now. Shaders are simultaneously higher and lower level, and have a lot of idiosyncrasies as a result of being designed for a specific and limited set of driver/GPU features.

Stuff like descriptor sets, resource registers, dispatch limitations, …

tiffanyh 52 days ago |

Re: Rust (and "safe" programming languages).

Does anyone have more details on NVIDIAs use of Spark/Ada?

All I can find is what's listed below:

https://www.adacore.com/case-studies/nvidia-adoption-of-spar...

NobodyNada 52 days ago | |

They gave a detailed talk last DEF CON: https://www.youtube.com/watch?v=KhWtkZmOPn4

cpeterso 52 days ago | |

Here's a recording of a 2020 presentation ("Securing the Future of Safety and Security of Embedded Software") from NVIDIA at the AdaCore conference:

https://www.youtube.com/watch?v=2YoPoNx3L5E

alecco 51 days ago |

> directly to PTX

Weird. There's a recent NVIDIA MLIR that is quite good and fast. Or they could target the even easier and more recent/fashionable tile IR [1] used by CuTile [2] (a little bit higher level but significantly easier to target, only loses on epilogue fusion and similar).

[1] https://docs.nvidia.com/cuda/tile-ir/

[2] https://developer.nvidia.com/cuda/tile

debugnik 52 days ago |

> (em dash) no DSLs, no foreign language bindings, just Rust.

Official CUDA port and they couldn't even bother with the introductory paragraph.

Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-

> MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.

I can't take this industry seriously anymore.

aiscoming 52 days ago | |

if they didnt use AI for their webpage people would say "why doesnt NVIDIA write its website and documentation with AI? don't they believe their own story about AI factories and employees managing thousands of agents doing the work for them?"

this is exactly on brand dog-fooding I would expect from an AI hyper

debugnik 52 days ago | | |

Literally no one would ever say that simply for editing the LLMisms away.

nialv7 52 days ago | |

I think the whole codebase was more or less written by AI...

segmondy 52 days ago | | |

that ship has long sailed, "it no longer matters" saying a codebase, an article was written with AI doesn't mean much, it could be good, it could be bad. folks often say it to generate outrage, but that means nothing. is the codebase great, good, bad, terrible? that's the only thing that matters.

argee 51 days ago | |

They also named it CUDA-oxide, flaunting their ignorance of what Rust lang is named after (fungi, not oxidation).

debugnik 51 days ago | | |

That's a lost battle even in the Rust community: Firefox's oxidation, Ferrous Systems, Redox, OxidOS, OxCaml (OCaml extensions partly inspired by Rust)… and every crate referencing oxidation in its name.

LtdJorge 51 days ago | | |

Yes, but have you seen the official logo? :)

mathisfun123 52 days ago | |

What exactly are you upset about? Someone observing that MLIR is extremely complex and dependent on LLVM...?

awestroke 52 days ago | | |

The quoted writing is AI slop, and OP is reacting to the fact that they did not write even the introductory text themselves (or at least bother to edit out clear AI/slop indicators)

rogermeier 52 days ago |

TileLang https://github.com/tile-ai/tilelang and stuff like Tile Kernels https://github.com/deepseek-ai/TileKernels will make CUDA obsolete one day.

jordand 52 days ago | |

CUDA is nearly 20 years old, and is not going anywhere, for many years to come

mathisfun123 52 days ago | |

this dude is a distinguished engineer at siemens commenting the dopiest/reddit level takes. lolol.

rogermeier 52 days ago | | |

agree not related to the rust to cuda compiler, you are right! But I have to say worth to look at upcoming new stuff, as this is kind a wow rust on good old CUDA.

AnimalMuppet 52 days ago | |

That's quite a claim for very little evidence.

arpadav 52 days ago | |

is this even comparable? lol

the__alchemist 52 days ago |

Does anyone know if this will let you share structs between host and device? That is the big thing missing so far with existing rust/CUDA workflows. (Plus the serialization/bytes barrier between them)

nihalpasham 51 days ago | |

Yes, absolutely. That is one of the advantages of cuda-oxide being single-source Rust: the host and device code can refer to the same Rust types, and the compiler has enough information to make the device-side layout match what rustc chose on the host.

So the intended workflow is not “define a Rust struct on the host, define a matching CUDA C++ struct for the device, then serialize bytes between them.” It is much closer to “define `MyStruct` once in Rust, put a `DeviceBuffer<MyStruct>` on the GPU, and write kernels that take `&[MyStruct]`, `*const MyStruct`, etc.”

There are two important pieces under the hood:

1. At the kernel boundary, cuda-oxide scalarizes aggregate parameters where needed. For example, slices become pointer + length, and simple structs can be flattened into fields for launch ABI purposes.

2. For actual struct layout, we use rustc’s computed layout rather than assuming declaration order or a C ABI. That matters because Rust is allowed to reorder/pad `repr(Rust)` structs. The device lowering carries those offsets/padding through so field access on the GPU matches the host-side layout.

So for plain data structs, nested structs, numeric fields, arrays, etc., yes, this is very much the goal: share the type directly instead of maintaining a separate CUDA representation or crossing a bytes/serialization boundary.

The caveat is the usual one: this does not make arbitrary host-owned Rust heap graphs GPU-addressable. A `Vec`, `String`, `Box`, trait object, or host pointer still contains an address, and that address has to refer to memory the GPU can actually access. For those cases you still need device allocation, unified/HMM memory, or a GPU-friendly representation.

But for the common “I have a Rust data type and want kernels to consume/update arrays of it” case: yes, that is exactly the kind of friction cuda-oxide is meant to remove.

adamnemecek 52 days ago |

Here's the repo link https://github.com/NVlabs/cuda-oxide

foo-bar-baz529 52 days ago |

One thing I’ve been wary about with Rust for CUDA is the bit of overhead that Rust adds that is usually negligible but might matter here, like bounds checks on arrays. Could it cause additional registers to get used, lowering the concurrency of a kernel?

TheMagicHorsey 52 days ago |

Oh lord. If this is the trend, I probably can't avoid improving my Rust language knowledge in the long term. I hate reading Rust so much right now. I guess I just have to get over that hump.

dbdr 52 days ago | |

Learning Rust is more alike to learning a new programming paradigm (e.g. functional when you only know imperative) than a new language with different syntax only. If you ignore that and try to jump directly to writing code more or less the same way as you used to, it will be painful. So take it slow and follow along with The Book (https://doc.rust-lang.org/book/). It all makes sense eventually and is very much worth it!

LtdJorge 51 days ago | | |

Fully agree

the__alchemist 52 days ago |

Hell yea! I have been doing it with Cudarc (Kernels) and FFI (cuFFT). Using manual [de]serialization between byte arrays and rust data structs. I hope this makes it lower friction!

nextaccountic 51 days ago |

https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...

> A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.

> cuda-oxide solves the problem in layers. The common case – one thread writes one element – is safe by construction, no unsafe required. The uncommon cases – shared memory, warp shuffles, hardware intrinsics – require unsafe with documented contracts. And the frontier cases – TMA, tensor cores, cluster-level communication – are fully manual, matching the complexity of the hardware they control.

That's.. not really Rusty. In Rust, we create new safe abstractions when the existing ones don't quite map to the problem at hand. See for example what's done in Rust for Linux

If it's not safe.. what's the point of Rust?

(it's okay to offer unsafe APIs for people that need to squeeze the last bit of performance, but this shouldn't be the baseline)

I compare this with userspace libs for APIs like io_uring and vulkan. designing safe APIs for them stuff is kind of hard (there's even some unsound attempts)

rowanG077 52 days ago |

Personally I really don't want new GPU languages that do not have AD as a first class citizen. I mean rust is an improvement over C++ CUDA but still.

erk__ 52 days ago | |

There is actually work on adding autodiff to Rust, maybe not really first class citizen, but at least build in: https://doc.rust-lang.org/std/autodiff/index.html (it is still at a pre-RFC stage so it is not something that soon will be added)

magnio 52 days ago | | |

Incredible, I have never heard of std::autodiff before. Isn't it rare for a programming language to provide AD within the standard library? Even Julia doesn't have it built-in, I wouldn't expect Rust out of all languages to experiment it in std.

rowanG077 52 days ago | | |

That's awesome, I didn't know that!

TallGuyShort 52 days ago | |

Sorry, what is AD in this context?

edit: oh, automatic differentiation?

huflungdung 52 days ago | | |

Active Directory

the__alchemist 52 days ago | |

This isn't a new GPU language; it's a lib which might replace FFI and third party libs.

rowanG077 52 days ago | | |

This is definitely not just a lib. This compiles rust to CUDA. If you call a full on compiler stack a lib, everything may as well be a lib.

vimarsh6739 52 days ago | |

Really hard to find alternatives to Julia for AD as a first class citizen

hellohello2 52 days ago | | |

I think the parent is mostly referring to solutions like Slang.D

corysama 52 days ago | |

So, https://shader-slang.org/ then :)

mathisfun123 52 days ago | |

every GPU related post has a comment which makes my eyes roll all the way back. this is the one for this post.

economistbob 52 days ago |

So, we have stainless, which means Linux code that never rusted. Now we need someone to make phosphorus so that we can turn rusty code into old iron. Then GPL fans can run Rust boxes, Stainless machines, or future proofed iron work horses.

All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...

Diversity can be our strength.

Making Iron C/c++ code can be called acid washing if it was rusted.

positron26 52 days ago | |

> we need someone

> Then GPL fans can

Checks out

rvz 52 days ago |

This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.

Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.

Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.

Unlike this...but uses Rust.

[0] https://news.ycombinator.com/item?id=48067228

zghst 52 days ago |

AWESOME!

paufernandez 51 days ago |

This is solved by Mojo already, they must be rushing something to compete, since Mojo is in version 1.0beta1

whatever1 52 days ago |

Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.