Beware of fast-math

Beware of fast-math(simonbyrne.github.io)

189 points by simonbyrne 4 years ago | 107 comments

Fun fact: when working on Herbie (http://herbie.uwplse.org), our automated tool for reducing floating-point error by rearranging your mathematical expressions, we found that fast-math often undid Herbie's improvements. In a sense, Herbie and fast-math are opposites: one makes code more accurate (sometimes slower, sometimes faster), while the other makes code faster (sometimes less accurate, sometimes more).

kazinator 4 years ago | |

If you have a program which finds some order of items in order to optimize something, and then you introduce some confounding technology which scrambles the order of the items afterward, that's indistinguishable from introducing bugs into that optimizing program; the output no longer implements the optimization requirements that the program tries to ensure.

I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.

Quekid5 4 years ago | |

That's a real "fun fact" kind of thing. Love it.

Of course, the 'real' solution is actual Numerical Analysis (as I'm sure you know) to keep the error properly bounded, but it's really interesting to have a sort of middle ground which just stabilizes the math locally... which might be good enough.

Other fun fact: Numerical Analysis is a thing that's really hard to imagine unless you happen to be introduced to it during an education. It's so obviously a thing once you've heard of it, but REALLY hard to come up with ex nihilo.

simonbyrne 4 years ago | |

Herbie is a great tool, especially for teaching.

shoo 4 years ago | |

thank you for sharing the link to Herbie, that looks like a useful tool.

If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.

Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.

mattpharr 4 years ago | | |

> Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?

There are! See followup work by @pavpanchekha and others on "Pherbie", which finds a set of Pareto-optimal rewritings of a program so that it's possible to trade-off error and performance: https://ztatlock.net/pubs/2021-arith-pherbie/paper.pdf.

headPoet 4 years ago |

-funsafe-math-optimizations always makes me laugh. Of course I want fun and safe math optimisations

nusaru 4 years ago | |

Personally I’m a fan of Kotlin’s “fun factory()”

seba_dos1 4 years ago | |

If you expect them to be safe, you're in for some fun!

krylon 4 years ago | |

They're not fun and safe, though, they're "fun-safe", so you don't enjoy yourself (too much) while doing math.

SeanLuke 4 years ago |

The other examples he gave trade off significant math deficiencies for small speed gains. But flushing subnormals to zero can produce a MASSIVE speed gain. Like 1000x. And including subnormals isn't necessarily good floating point practice -- they were rather controversial during the development of IEEE 754 as I understand it. The tradeoff here is markedly different than in the other cases.

dmitrygr 4 years ago |

Contrarian waypoint: beware of not-fast-math. Making things like atan2f and sqrtf set errno takes you down a very slow path, costing you significant perf in cases where you likely do not want it. And most math will work fine with fast-math, if you are careful how you write it. (Free online numerical methods classes are available, eg [1]) Without fast-math most compilers cannot even use FMA instructions (costing you up to 2x in cases where they could be used otherwise) since they cannot prove it will produce the same result - FMA will actually likely produce a more accurate result, but your compiler is handicapped by lack of fast-math to offer it to you.

[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...

vchuravy 4 years ago |

Especially the fact that loading a library compiled with GCC and fast math on, can modify the global state of the program... It's one of the most baffling decisions made in the name of performance.

I would really like for someone to take fast math seriously, and to provide well scoped and granular options to programmers. The Julia `@fastmath` macro gets close, but it is two broad. I want to control the flags individually.

Also the question how that interacts with IPO/inlining...

mhh__ 4 years ago | |

D (LDC) lets you control the flags on a per function basis.

So one can (and we do at work) have @optmath which is a specific set of flags (just a value we defined at compile time, the compiler recognizes it as a UDA) we want as opposed to letting the compiler bulldoze everything.

physicsguy 4 years ago | | |

You can do that on a per-compiler basis for e.g. with

#pragma GCC optimize(“fast-math")

titzer 4 years ago | |

It should always have been a bit in the instruction encoding, never global state.

smitop 4 years ago |

The LLVM IR is more expressive than clang is for expressing fast-math: it supports making an operation use fast-math optimization on a per operation basis (https://llvm.org/docs/LangRef.html#fastmath).

simonbyrne 4 years ago | |

Do you know what happens when you have ops with different flags? e.g. if you have (a + b) + c, where one + allows reassoc but one doesn't?

diamondlovesyou 4 years ago | | |

(a+b)+c has two ops in LLVM: addition is a binop, meaning it has two "arguments", thus (a+b) and adding "c" are separate instructions. You can't directly add three or more values.

dlsa 4 years ago |

Never considered fast-math. I get the sense its useful but can create awkward and/or unexpected surprises. If I was to use it I'd have to have a verification test harness as part of some pipeline to comfirm no weirdness. Literally a bunch of example canary calculations to determine if fast-math will kill or harm some real use case.

Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.

I tend to use calculations for deterministic purposes rather than pure accuracy. 1+1=2.1 where the answer is stable and approximate is still better and more useful than 1+1=2.0 but where the answer is unstable. Eg because one of those is 0.9999999 and the precision triggers some edge case.

simonbyrne 4 years ago | |

I tried to lay out a reasonable path: incrementally test accuracy and performance, and only enable the necessary optimizations to get the desired performance. Good tests will catch the obvious catastrophic cases, but some will inevitably be weird edge cases.

As always, the devil is in the details: you typically can't check exact equality, as e.g. reassociating arithmetic can give slightly different (but not necessarily worse) results. So the challenge is coming up with appropriate measure of determining whether something is wrong.

willis936 4 years ago |

I use single precision floating point to save memory and computation in applications where it makes sense. I had a case where I didn't care about the vertical precision of a signal very much. It had a sample rate in the tens of thousands of samples per second. I was generating a sinusoid and transmitting it. On the receiver the signal would become garbled after about a minute. I slapped my head and immediately realized I ran out of precision by using a single precision time value feeding the sin function when t became too large with the small increment.

  sin(single(t)) == bad

  single(sin(t)) == good

adgjlsfhk1 4 years ago | |

IMO, sinpi(x)=sin(pi*x) is a better function because it does much better here. the regular trig functions are approximately 20% slower for most implementations in order to accurately reduce mod 2pi, while reducing mod 2 is pretty much trivial.

willis936 4 years ago | | |

I think the real solution I should have adopted is incrementing t like this:

  t = mod(t + ts, 1 / f)

Since I'm just sending a static frequency the range of time never needs to be beyond one period. However, using a double here is far from the critical path in increasing performance and it all runs fast enough anyway.

willis936 4 years ago | | |

Also, thanks. I had not heard of this function before, but apparently it was added to MATLAB in 2018.

zoomablemind 4 years ago |

On the subject of the floating-point math in general, I wonder what's the practical way to treat the extreme order values (close to zero ~ 1E-200, or infinity ~ 1E200, but not zero or inf)? This can take place in some iterative methods, expansion series, or around some singularities.

How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?

For example, calculating some resulting value function, expressed as

v(x)=f(x)/g(x),

where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.

simonbyrne 4 years ago | |

The key thing about floating point is that it maintains relative accuracy: in your case, if you have say f(x) and g(x) are both O(1e200), and are correct to some small relative tolerance, say 1e-10 (that is, the absolute error is 1e190). Then the relative for f(x)/g(x) stays nicely bounded to about 2e-10.

However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).

zoomablemind 4 years ago | | |

Both f(x) and g(x) could be calc'ed to proper machine precision (say, GSL double prec ~ 1E-15). Would this imply, that beyond the machine precision the values are effectively fixed to resp. zero or infinity, instead of carrying around the extreme orders of magnitude?

gsteinb88 4 years ago | |

If you can, working with the logarithms of the intermediate large or small values is one way around the issue

One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...

junon 4 years ago | |

Not very reliable. For precision one usually seeks out the MPFR library or something similar.

zoomablemind 4 years ago | | |

If anyone else wonders, MPFR (Multi-Precision Floating-point calc with correct Rounding). The application domain is Interval Math [1].

[1]:http://cs.utep.edu/interval-comp/applications.html

adgjlsfhk1 4 years ago | | |

ARB https://arblib.org/ is often better than mpfr.

kzrdude 4 years ago |

It looks like -fassociative-math is "safe" in the sense that it can not be used to get UB in working code? That's a good property to make it easier to use in the right context.

mbauman 4 years ago | |

See the one footnote: you can re-associate a list of 2046 numbers such that they sum to _any_ floating point number between 0 and 2^970.

https://discourse.julialang.org/t/array-ordering-and-naive-s...

StefanKarpinski 4 years ago | | |

To be fair though, as noted further down in that thread, naive left-to-right summation is the worst case here since the tree of operations is as skewed as possible. I think that any other tree shape is better and compilers will tend to make the tree more balanced, not less, when they use associativity to enable SIMD optimizations. So while reassociation is theoretically arbitrarily bad, in practice it's probably mostly ok. Probably.

evilotto 4 years ago | | |

Floating point math is fun:

  def re_add(a,b,c,d,e):
      return (a+b+c+d+e) == (2 * (c+a+b+d+e))
  
  print(re_add(1e17, -1e17, 3, 2, 1))

simonbyrne 4 years ago | |

Not necessarily: if your cospi(x) function is always returning 1.0 (https://github.com/JuliaLang/julia/issues/30073#issuecomment...), but you wrote your code assuming the result was in a different interval, then you could quite easily invoke undefined behavior.

adgjlsfhk1 4 years ago | |

Yeah. It's safe in that you won't get UB, but it's bad in that you can get arbitrarily wrong answers.

wiml 4 years ago | | |

To be fair, if you're using floating point at all you can get arbitrarily wrong answers. The nice thing about ieee754 conformance is that you can, with a lot of expertise, somewhat reason about the kinds of error you're getting. But for code that wasn't written by someone skilled in numerical techniques, and that's the vast majority of fp code, is fast-math worse than the status quo?

gnufx 4 years ago |

You will generally want at least -funsafe-math-optimizations for performance-critical loops. Otherwise you won't get vectorization at all with ARM Neon, for instance. You also won't get some simple loops vectorized (like products) or generally(?) loop nest optimizations. You just may not be able to afford the maybe order of magnitude cost if your code is bottlenecked on such things (although HPC code actually may well not be).

In my experience much scientific Fortran code, at least, is OK with something like -ffast-math, at least because it's likely to have been used with ifort at some stage, and even with non-754-compliant hardware if it's old enough. Obviously you should check, though, and perhaps confine such optimizations to where they're needed.

BLIS turned on -funsafe-math-optimizations (if I recall correctly) to provide extra vectorization, and still passed its extensive test suite. (The GEMM implementation is possibly the ultimate loop nest restructuring.)

pfdietz 4 years ago |

The link to Kahan Summation was interesting.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

optimalsolver 4 years ago |

"-fno-math-errno" and "-fno-signed-zeros" can be turned on without any problems.

I got a four times speedup on <cmath> functions with no loss in accuracy.

owlbite 4 years ago | |

Unless, of course, you have some algorithm that depends on signed zeros. Which is basically the same with all the optimizations the article complains about.

I'd suggest -ffp-contract=fast is a good idea for 99% of code. It's only going to break things where very specific effort has gone in to the numerical analysis, and likely the authors of such things are sufficiently fp-savy to tell you not to do the thing.

adgjlsfhk1 4 years ago | | |

Is there any algorithm that depends on signed zeros? I'm not aware of any.

jjgreen 4 years ago |

One trick that I happened upon was speeding up complex multiplication (like a factor of 5) under gcc with the --enable-cx-fortran switch.

bruce343434 4 years ago | |

-fcx-fortran-rules

    Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.

    The default is -fno-cx-fortran-rules.

[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...

markhahn 4 years ago |

NaN's should trap, but compilers should not worry about accurate debugging.