Rust and C++ on Floating-Point Intensive Code(reidatcheson.com) |
Rust and C++ on Floating-Point Intensive Code(reidatcheson.com) |
What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.
* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)
* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.
* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).
* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...
* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.
* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.
Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x!
Edit: Although, on rereading the above post, I see diamondlovesyou did mention indices, so... not really sure what's going on.
Even for slices, using get_unchecked(1..) to get a smaller subslice without bounds checking might be better than pointer arithmetic as long as the slice lengths get optimized away (i.e. they are never used and never passed to non-inlined functions).
``` for (i, val) in my_slice.iter().enumerate() { let x = *val + 9999; } ```
You have to pay the price of compatibility with the existing toolchain even if it's not your explicit goal.
the primary reason THIS code didn't optimize as well is that you can't pass ffast-math to llvm from rust in any useful way (yet)
Seems a little gung ho to disable guaranteed index checking in a video codec no? I know you still do the checks, but it sounds like it's not in a statically guaranteed way.
You can still have them enabled for debugging, at least. (Something not generally possible in C/C++, sadly.)
f64::mul_add(self, a: f64, b: f64) -> f64
Adding it to the code indeed allows the LLVM to generate the "vfma" instruction. But it didn't significantly improve performance, on my machine at least. $ ./iterators 1000
Normalized Average time = 0.0000000011943495282455513
sumb=89259.51980374461
$ ./mul_add 1000
Normalized Average time = 0.0000000011861410852805122
sumb=89259.52037960211
Maybe the performance gap is not due to what the author thought...[1] https://doc.rust-lang.org/std/primitive.f64.html#method.mul_...
Edit: After some search I found these:
https://github.com/rust-lang/rust/issues/21690
https://doc.rust-lang.org/core/intrinsics/fn.fadd_fast.html
Writing a wrapper around f64 that uses these intrinsics shouldn't be too hard. I don't program in Rust though.
C++ compilers have flags to enable it globally. gcc and clang include the optimization in -Ofast.
Rust allows you to choose at a code level (but usually people don't know about it). Perhaps it should also have a global fast-math flag that would automatically optimize it. Pros and cons to that.
It is interesting to note that, whereas most ffast-math optimization will trade precision for reduced computing time, adding an FMA can only improve the precision of the output (and thus it is a safe optimization).
enum List<T> {
Nil,
Cons(T, List<T>)
}
instead, you have to box/reference-ify the recursive occurrence: enum List<T> {
Nil,
Cons(T, Box<List<T>>)
}
so in certain circumstances it doesn't let you "coproduct" two types together, you might need to modify one a bit, which makes it a technically-not-exactly-a-coproduct (i think). a bit of a stretch but it sort of makes sense next to a by-reference-only ML langs where you can (co)product anything as you please(btw, it's the same for recursive products)
---
1 - https://users.rust-lang.org/t/recursive-enum-types/2938/2
enum List<'a, T> {
Nil,
Cons(T, &'a List<'a, T>)
}
fn main() {
let list = List::Cons("hello", &List::Nil);
}
Box is usually chosen because it's a good default choice.We may add a wrapping type, similar to what we do for integer behavior. But in general, adding flags to change major behavior is not something we do.
For this example you'd probably want -fassociative-math and not the other stuff that may result in incorrect code. -ffast-math was not actually used in the clang compilation and it's possible that the -fvectorize that was used picks a sensible mix of options.
Here's a preliminary discussion:
https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way...
Trying to do an RFC process for this could be useful. The rust development process seems to be pretty good at thinking deeply about these things.
As a C++ programmer who routinely uses fast-math "until something breaks" with DSP code, I would find that capability very attractive.
writing a wrapper is not trivial, at all.
what Rust could do though, is add a wrapped float type, that the compiler will forward to llvm saying "you can ffast-math these" That is the approach Rust tends to take for these kinds of things, though no plans are in the works to do this to make floats vectorizeable yet. Maybe we should start such plans?
It's worth explaining why. So far as I understand the situation, dependencies may assume and depend on the absence of ffast-math. Enabling it globally for a compilation is generally considered to be a footgun for this reason.
In my experience -Ofast/-ffast-math yields very impressive results for FP code.
If you can tolerate platform-specific variation in the trailing parts of floating point numbers, it's wonderful.
Also depends how much it affects performance. It it doubles it, ok I'll accept the risk. 10% faster? No thanks I'd rather play it safe.
AFAIK this does not work atm due to a codegen bug in llvm (which can also affect code using restrict in C in some cases). This bug will get fixed one day, but most likely another bug will be revealed at this point… This part of LLVM was never really used in C as much as in Rust, so they keep finding bugs in it. Hopefully it will get fine in the long run, but I'm not holding my breath.
I mean maybe? But I probably wouldn't anyway for this case as a slice reference is actually a "fat" pointer (ie twice the size of a normal pointer) and the length of the slice won't be used (the block size is known per kernel); LLVM might delete the length part anyway.
These are automatically generated kernels, so readability isn't the primary concern here.
GCC emits FMA instructions at -O2 without -ffast-math.
Edit: Ok actually it sounds like it could literally break some algorithms, see https://news.ycombinator.com/item?id=21342974
This means that to fully utilizes FMA you need to unroll a loop more. Sometimes yyou just can't, and the other time you use more instructions, use more cache.
In short it's not always better.
Also as other said, FMA has better accuracy than separate Add + Mul
e.g. https://www.cs.cmu.edu/~quake/robust.html
Some keywords to look for: “compensated arithmetic”, “error-free transformations”.
FMAs generally speed up these tools, but you need to be careful and deliberate about how they are used.
(Disclaimer: I am not an expert on this, just some guy on the internet.)
"The program" contains two hot loops. Judging from the assembly code you linked in a sibling comment, only the second of these loops was vectorized, the first one wasn't. This slower non-vectorized loop will still dominate execution time.
And for whatever it's worth, dropping the original article's
for i in 0..n{
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
}
in place of your loop using iterators and FMA, you still get nicely vectorized code (though without FMA) for this loop: .LBB6_120:
vmovupd zmm2, zmmword ptr [r12 + 8*rcx]
vmovupd zmm3, zmmword ptr [r12 + 8*rcx + 64]
vmovupd zmm4, zmmword ptr [r12 + 8*rcx + 128]
vmovupd zmm5, zmmword ptr [r12 + 8*rcx + 192]
vmovupd zmm6, zmmword ptr [r14 + 8*rcx]
vmovupd zmm7, zmmword ptr [r14 + 8*rcx + 64]
vmovupd zmm8, zmmword ptr [r14 + 8*rcx + 128]
vmovupd zmm9, zmmword ptr [r14 + 8*rcx + 192]
vmulpd zmm10, zmm2, zmmword ptr [rbx + 8*rcx]
vsubpd zmm6, zmm6, zmm10
vmulpd zmm10, zmm3, zmmword ptr [rbx + 8*rcx + 64]
vsubpd zmm7, zmm7, zmm10
...
Neither FMA nor iterators make a difference for whether this loop is vectorized, so any speedups are necessarily limited.If it's OK I'll link to this comment as the inspiration.
On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.
See the generated assembler here: https://rust.godbolt.org/z/G5A2u0
also, sometimes a SIMD instruction is used but only on 1 lane at a time. this is actually common with floating point code.
https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.
So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.
Floating point code is really difficult to do correctly. LLVM doesn't actually model IEEE 754 correctly yet, although hopefully the remaining issues with the constrained intrinsics will be fixed by the end of the year (even then, sNaN support is likely to still be broken for some time).
What makes floating point more difficult than integers is two things. First, there is an implicit dependency on the status register that greatly inhibits optimization, yet very few users actually care about that dependency. The second issue is that there is many more properties that matter for floating point. For an integer, you essentially only care about three modes: wrapping (w/ optional carry flag), saturating, and no-overflow. Exposing these as separate types exhaustively is easy. For floating point, you have orthogonal concerns of rounding mode (including dynamic), no-NaN, no-infinity, denormals (including flush-to-zero support), contraction to form FMA, reciprocal approximation, associative, acceptable precision loss (can I use an approximate inverse sqrt?), may trap, and exception sticky bits. Since they're orthogonal, that means that instead of a dozen types, you need a few thousand types to combine them, although many combinations are probably not going to be too interesting.
fastmath is probably different anyway, as the "breaking" is on a floating point logic level, as in: results become imprecise, but not exactly "wrong" - as in undefined behaviour. but i don't know fastmath, so i might be wrong.
On the /r/rust thread, folks provided examples of why fastmath would produce UB in safe Rust.
The problem is mentionned in the Wikipedia page of the Kahan summation (and I have been able to reproduce it with gcc) : https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Poss...
This is actually my area of research, I could contribute if you point me to an RFC.
The RFC about that is https://github.com/rust-lang/rfcs/pull/2686 , where you see users kind of split into the "I want faster binaries" and "I want more deterministic execution" camps. Neither are wrong TBH.
Some people have tried to show there that enabling FP-contraction by default isn't always better / more precise, but I'm not sure if they succeeded.
I think associativity is necessary to vectorize reduction operations like:
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
I haven't looked at the code generated by ICC, but I would expect it to vectorize this by computing tuples of "partial sums", roughly as follows: r0 += (c[i+0]-a[i+0]*b[i+0])*a[i+0]*(c[i+0]-a[i+0]*b[i+0]);
r1 += (c[i+1]-a[i+1]*b[i+1])*a[i+1]*(c[i+1]-a[i+1]*b[i+1]);
r2 += (c[i+2]-a[i+2]*b[i+2])*a[i+2]*(c[i+2]-a[i+2]*b[i+2]);
...
and then doing a horizontal sum r = r0 + r1 + r2 + ... in the end. But this requires associativity. (And commutativity, but that's a given.)As illustrated in the Wikipedia page, if move terms according to associativ rules you can show that one of the term is always zero and conclude that it is useless, dropping it from the computation.
However, in practice, floating-points are not associativ and that term contains the numerical error of the previous operation.