Rust and C++ on Floating-Point Intensive Code

Rust and C++ on Floating-Point Intensive Code(reidatcheson.com)

216 points by payasr 6 years ago | 94 comments

I have some experience with this, ie ensuring LLVM optimizes and codegens the "best"! I have been working to generate target independent "kernels" for the Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM to generate machine code similar in quality to hand written ASM. Granted, this is on integers and not floats, but the same principles should apply.

What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.

* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)

* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.

* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).

* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...

* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.

* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.

Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x!

gameswithgo 6 years ago | |

the first order reason Rust and C++ differ in the article is because Rust will not pass the ffastmath flag to llvm, not because of any of this stuff.

diamondlovesyou 6 years ago | | |

Be that as it may, I wasn't talking about C++. Also, the integer operation optimizations require no such flags on either side, and yet the resulting machine code was still very poor for Rust.

dochtman 6 years ago | |

Not sure I understand the part about raw pointers. As far as I understand, Rust references will surely turn into pointers at the LLVM IR level?

comex 6 years ago | | |

They do. The issue may be that references don't support pointer arithmetic – that is, given `x: &u32` you cannot get `x[1]`. The normal approach is to use a reference to a slice, `x: &[u32]`, and the default bounds-checked indexing. Rust does let you do unchecked indexing on slices (with `unsafe`), but it may be more efficient to avoid indices entirely in favor of pointer arithmetic. LLVM can often optimize index calculations into pointer arithmetic, but not always.

Edit: Although, on rereading the above post, I see diamondlovesyou did mention indices, so... not really sure what's going on.

diamondlovesyou 6 years ago | | |

True. I should have explained that better. If you're accessing an array, you'll have to do the pointer offset-ing yourself (otherwise you're using a slice and all the checking that entails). Thus, you can't use a reference type, because reference types don't have `<*const T>::add` like pointers do (also, `&T as usize` is invalid; you have to go through a pointer type first). I suppose I assumed you'd be accessing an array/slice; references otherwise are fine.

devit 6 years ago | | |

Rust references should in general optimize better because they give stronger aliasing guarantees.

Even for slices, using get_unchecked(1..) to get a smaller subslice without bounds checking might be better than pointer arithmetic as long as the slice lengths get optimized away (i.e. they are never used and never passed to non-inlined functions).

davidhyde 6 years ago | | |

You can use iterators to avoid unnecessary bounds checking on every element in a slice and you can still get an index to the value. Something like this:

``` for (i, val) in my_slice.iter().enumerate() { let x = *val + 9999; } ```

nine_k 6 years ago | |

To sum up: LLVM is heavily optimized for C code, not so much for Rust. So Rust code has to imitate certain C mannerisms for the optimizer to kick in.

You have to pay the price of compatibility with the existing toolchain even if it's not your explicit goal.

oconnor663 6 years ago | | |

I'm not sure this is the optimizer's fault. If I'm doing a bounds-checked array access that might panic, that panic is an observable side effect that cannot be reordered with respect to other observable side effects. That puts constraints on what the optimizer can do, not because it's not smart enough, but because it has to respect the meaning of the program.

gameswithgo 6 years ago | | |

there are aspects of Rust that are easier for llvm to optimize with than c. there are things rust isn't yet doing that it will do to make llvm be able to optimize it better as well.

the primary reason THIS code didn't optimize as well is that you can't pass ffast-math to llvm from rust in any useful way (yet)

keldaris 6 years ago | |

Can anyone elaborate a bit on why this PR has such terrible compile times? As someone with only a passing familiarity of Rust (I do most of my work in C/C++ and generally expect sub-10s compile times), I would naively expect that writing more C-like code which uses fewer of Rust's language features would compile faster or at least not considerably slower. What's going on there?

kbenson 6 years ago | | |

The biggest Rust language feature is the borrow checker IMO, and unless you're running in unsafe, that's still doing its job. I always assumed the slowness was more based on the borrow checker analysis than on the error wrapping (option) stuff, slices, or functional bits.

pcwalton 6 years ago | |

Sounds like it's fair to say most of this boils down to bounds checks and SIMD?

IshKebab 6 years ago | |

> index checking is expensive

Seems a little gung ho to disable guaranteed index checking in a video codec no? I know you still do the checks, but it sounds like it's not in a statically guaranteed way.

MaxBarraclough 6 years ago | | |

Surely tight loops in performance-sensitive code, are precisely where you'd consider disabling runtime checks, no?

You can still have them enabled for debugging, at least. (Something not generally possible in C/C++, sadly.)

lovasoa 6 years ago |

It looks like what the author was looking for is [1]

    f64::mul_add(self, a: f64, b: f64) -> f64

Adding it to the code indeed allows the LLVM to generate the "vfma" instruction. But it didn't significantly improve performance, on my machine at least.

    $ ./iterators 1000
    Normalized Average time = 0.0000000011943495282455513
    sumb=89259.51980374461

    $ ./mul_add 1000
    Normalized Average time = 0.0000000011861410852805122
    sumb=89259.52037960211

Maybe the performance gap is not due to what the author thought...

[1] https://doc.rust-lang.org/std/primitive.f64.html#method.mul_...

gpderetta 6 years ago | |

Hum, did the program get vectorized?

lovasoa 6 years ago | | |

As I said, the compiler did generate FMA instructions. These are SIMD instructions, so yes, the program was vectorized.

lovasoa 6 years ago | | |

You can see the full compiler output here:

https://rust.godbolt.org/z/FbDqye

leni536 6 years ago |

So the difference basically boils down to -ffast-math, right? Is there an equivalent in Rust?

Edit: After some search I found these:

https://github.com/rust-lang/rust/issues/21690

https://doc.rust-lang.org/core/intrinsics/fn.fadd_fast.html

Writing a wrapper around f64 that uses these intrinsics shouldn't be too hard. I don't program in Rust though.

dhruvdh 6 years ago |

It's very easy to do FMA's using .mul_add() on floats in Rust, which the author didn't seem to know about.

magicalhippo 6 years ago | |

Ideally the compiler should be able to do this by itself though, at least with the appropriate flag to enable it.

paulddraper 6 years ago | | |

FMA isn't a safe optimization as it can give different results.

C++ compilers have flags to enable it globally. gcc and clang include the optimization in -Ofast.

Rust allows you to choose at a code level (but usually people don't know about it). Perhaps it should also have a global fast-math flag that would automatically optimize it. Pros and cons to that.

jacobolus 6 years ago | | |

If you’re doing fiddly numerical work, this must definitely be optional, as swapping separate multiplication and addition for FMA (or vice versa) can compromise correctness. In some cases you need two different algorithms if FMA is present or absent.

ibotty 6 years ago | |

Care to rewrite the program with `.mul_add()`?

lovasoa 6 years ago | | |

I did rewrite the code with mul_add, and didn't see any significant performance improvement. See my comment above.

brandmeyer 6 years ago |

This is interesting to see. But if I'm going to compare numerical C++ against numerical Rust, then I would be using a higher-level library for the comparision. What is Rust's Other Leading Brand (TM) for the Eigen C++ library?

steveklabnik 6 years ago | |

That comparison (I’m not familiar enough with Eigen to truly say) is going to change over time too; once const generics lands (which is proceeding, finally) the APIs for numerics libraries are going to be significantly different in Rust.

nravic 6 years ago | |

There's a bunch of Eigen analogues in Rust, which is slightly frustrating. ndarray is pretty great though

toolslive 6 years ago |

before you look at speed, did you verify you get the exact same math results in Rust and C++ (and for each compiler and platform) ? For C++ code, I have seen the results of calculations vary across compilers (and flags)

nestorD 6 years ago |

The authors ends by noting that FMA would probably have improved the performances for the Rust code.

It is interesting to note that, whereas most ffast-math optimization will trade precision for reduced computing time, adding an FMA can only improve the precision of the output (and thus it is a safe optimization).

gameswithgo 6 years ago | |

pedantry: ffast-math does not always trade precision. It simply trades the results being the same as if they were not vectorized. A vectorized sum of floats for instance is more accurate, not less.

leshow 6 years ago |

What's "almost" algebraic about enum? It can definitely be used to construct sum types, and you can make product types with struct or inline in an enum

uryga 6 years ago | |

my best guess is that you can't do recursive enums without explicit boxing [edit: or other forms of indirection, like &T]¹. so you can't do this:

  enum List<T> {
    Nil,
    Cons(T, List<T>)
  }

instead, you have to box/reference-ify the recursive occurrence:

  enum List<T> {
    Nil,
    Cons(T, Box<List<T>>)
  }

so in certain circumstances it doesn't let you "coproduct" two types together, you might need to modify one a bit, which makes it a technically-not-exactly-a-coproduct (i think). a bit of a stretch but it sort of makes sense next to a by-reference-only ML langs where you can (co)product anything as you please

(btw, it's the same for recursive products)

---

1 - https://users.rust-lang.org/t/recursive-enum-types/2938/2

steveklabnik 6 years ago | | |

You don't have to box, but you do need some sort of type to make things sized. This is usually a pointer of some kind, but any kind of pointer works. Take references, for example:

  enum List<'a, T> {
    Nil,
    Cons(T, &'a List<'a, T>)
  }
  
  fn main() {
      let list = List::Cons("hello", &List::Nil);
  }

Box is usually chosen because it's a good default choice.

Ygg2 6 years ago | | |

I think that's because Rust types are Sized, but I could be wrong. The first example has size = Infinity, while the second has a constant size.

leshow 6 years ago | | |

Thanks for commenting, that's probably it. I was aware of the requirement for explicit box-ing but it didn't immediately come to mind.

vkaku 6 years ago |

Okay, can someone give an explanation for why Rust does not mimic the -O fast behavior? Is this something they plan to add?

steveklabnik 6 years ago | |

It leads to undefined behavior in safe code in the general case.

We may add a wrapping type, similar to what we do for integer behavior. But in general, adding flags to change major behavior is not something we do.

Myrmornis 6 years ago |

This is an extremely clear article and apparently well-executed experiment.

rathinmadhu 6 years ago |

Excellent

rathinmadhu 6 years ago |

Supper