ML Workload Runs 30x Faster w AVX-512

ML Workload Runs 30x Faster w AVX-512(parallelprogrammer.substack.com)

3 points by ryandotsmith 256 days ago | 1 comment

I want to try this for my molecular dynamics simulations in rust I have the pipeline set up (core::simd mimic with vector support and doesn't need nightly), but am focused on GPU now since this won't be as good as that. Obviously good for the case of the user not having an nvidia GPU, so want to add eventually.

I mean, why have 1 f32 when you can have 16? (The answer is you have to be careful about how you load and unload them, and it can be tedious). Also, the tablet I'm typing on now can only do 8 f32s, which is half as impressive.

From the article: > A 30x Speedup

Yes; that's good! If you don't have a GPU. Then you are looking at more than those gains, so the question is: Do you have the time to implement, and make sure it's working correctly? (Maybe the language or toolkit you're using has low-friction abstractions. Rust doesn't. I made some helpers, but they're still not great)