ML Workload Runs 30x Faster w AVX-512(parallelprogrammer.substack.com) |
ML Workload Runs 30x Faster w AVX-512(parallelprogrammer.substack.com) |
I mean, why have 1 f32 when you can have 16? (The answer is you have to be careful about how you load and unload them, and it can be tedious). Also, the tablet I'm typing on now can only do 8 f32s, which is half as impressive.
From the article: > A 30x Speedup
Yes; that's good! If you don't have a GPU. Then you are looking at more than those gains, so the question is: Do you have the time to implement, and make sure it's working correctly? (Maybe the language or toolkit you're using has low-friction abstractions. Rust doesn't. I made some helpers, but they're still not great)