Beware of fast-math(simonbyrne.github.io) |
Beware of fast-math(simonbyrne.github.io) |
I don't see how Herbie's accuracy improvements could not be undone, if Herbie's output is fed to a back-end which doesn't preserve Herbie's order of operations as Herbie requires and depends on.
Of course, the 'real' solution is actual Numerical Analysis (as I'm sure you know) to keep the error properly bounded, but it's really interesting to have a sort of middle ground which just stabilizes the math locally... which might be good enough.
Other fun fact: Numerical Analysis is a thing that's really hard to imagine unless you happen to be introduced to it during an education. It's so obviously a thing once you've heard of it, but REALLY hard to come up with ex nihilo.
If I follow at high level, it looks like Herbie is trying to rewrite expressions to minimise error without runtime performance constraints.
Are there alternative tools that focus on rewriting code to maximise performance while keeping error below some configurable bound?
i guess compilers are generally focused on the latter problem, perhaps without giving the user much control over the degree of error they are willing to tolerate.
There are! See followup work by @pavpanchekha and others on "Pherbie", which finds a set of Pareto-optimal rewritings of a program so that it's possible to trade-off error and performance: https://ztatlock.net/pubs/2021-arith-pherbie/paper.pdf.
[1] https://ocw.mit.edu/courses/mathematics/18-335j-introduction...
I would really like for someone to take fast math seriously, and to provide well scoped and granular options to programmers. The Julia `@fastmath` macro gets close, but it is two broad. I want to control the flags individually.
Also the question how that interacts with IPO/inlining...
So one can (and we do at work) have @optmath which is a specific set of flags (just a value we defined at compile time, the compiler recognizes it as a UDA) we want as opposed to letting the compiler bulldoze everything.
#pragma GCC optimize(“fast-math")
Is this a sensible approach? What are others experiences around this? I've never bothered with this kind of optimisation and I now vaguely feel like I'm missing out.
I tend to use calculations for deterministic purposes rather than pure accuracy. 1+1=2.1 where the answer is stable and approximate is still better and more useful than 1+1=2.0 but where the answer is unstable. Eg because one of those is 0.9999999 and the precision triggers some edge case.
As always, the devil is in the details: you typically can't check exact equality, as e.g. reassociating arithmetic can give slightly different (but not necessarily worse) results. So the challenge is coming up with appropriate measure of determining whether something is wrong.
sin(single(t)) == bad
single(sin(t)) == good t = mod(t + ts, 1 / f)
Since I'm just sending a static frequency the range of time never needs to be beyond one period. However, using a double here is far from the critical path in increasing performance and it all runs fast enough anyway.How reliable is it to keep the exreme orders in expectation that the resp. quatities would cancel the orders properly yielding a meaningful value (rounding wise)?
For example, calculating some resulting value function, expressed as
v(x)=f(x)/g(x),
where both f(x) and g(x) are oscillating with a number of roots in a given interval of x.
However if you do f(x) - g(x), the absolute error is on the order of 2e190: if f(x) - g(x) is small, then now the relative error can be huge (this is known as catastrophic cancellation).
One example talking about this here: http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...
https://discourse.julialang.org/t/array-ordering-and-naive-s...
def re_add(a,b,c,d,e):
return (a+b+c+d+e) == (2 * (c+a+b+d+e))
print(re_add(1e17, -1e17, 3, 2, 1))In my experience much scientific Fortran code, at least, is OK with something like -ffast-math, at least because it's likely to have been used with ifort at some stage, and even with non-754-compliant hardware if it's old enough. Obviously you should check, though, and perhaps confine such optimizations to where they're needed.
BLIS turned on -funsafe-math-optimizations (if I recall correctly) to provide extra vectorization, and still passed its extensive test suite. (The GEMM implementation is possibly the ultimate loop nest restructuring.)
I got a four times speedup on <cmath> functions with no loss in accuracy.
I'd suggest -ffp-contract=fast is a good idea for 99% of code. It's only going to break things where very specific effort has gone in to the numerical analysis, and likely the authors of such things are sufficiently fp-savy to tell you not to do the thing.
Complex multiplication and division follow Fortran rules. Range reduction is done as part of complex division, but there is no checking whether the result of a complex multiplication or division is NaN + I*NaN, with an attempt to rescue the situation in that case.
The default is -fno-cx-fortran-rules.
[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-Option...For example Zen CPUs have negligible penalties for handling denormals, but many Intel models have a penalty between 100 and 200 clock cycles for an operation with denormals.
Even on the CPU models with slow denormal processing, a speedup between 100 and 1000 exists only for the operation with denormals itself and only when the operation belonged to a stream of operations working at the maximum CPU SIMD speed, when during the one hundred and something lost clock cycles the CPU could have done 4 or 8 operations during every clock cycle.
Any complete computations cannot have a significant percentage of operations with denormals, unless they are written in an extremely bad way.
So for a complete computation, even on the models with bad denormal handling, a speedup of more than a few times would be abnormal.
The only controversy that has ever existed about denormals is that handling them at full speed increases the cost of the FPU, so lazy or greedy companies, i.e. mainly Intel, have preferred to add the flush-to-zero option for gamers, instead of designing the FPU in the right way.
When the correctness of the results is not important, like in many graphic or machine-learning applications, using flush-to-zero is OK, otherwise it is not.
Outside of a vanishingly few edge cases, I think the subnormal debate is basically over, except, apparently, inside of Intel. Every single other architecture and microarchitecture manages to handle subnormals with relative ease, with only a handful of clock cycle penalty. I think Intel hardware should be called out, not programmers who just want the 35 year old floating point standard to be fast like it is on other chips.
Similar stories happened in the GPU world, and my understanding is that essentially all GPUs are converging on IEEE compliance by default now.
You could also say some companies have been kind enough to make hardware for gamers that doesn’t have costly features they do not need.
That's why finer-grained flags are needed — yes, FMAs and SIMD are essential for _both_ performance and improved accuracy, but `-ffast-math` bundles so many disparate things together it's impossible to understand what your code does.
> And most math will work fine with fast-math, if you are careful how you write it.
The most hair-pulling part about `-ffast-math` is that it will actively _disable_ your "careful code." You can't check for nans. You can't check for residuals. It'll rearrange those things on your behalf because it's faster that way.
I'm not an expert on this, but for my own code I've been meaning to better understand the discussion here [1], which suggests that there ARE ways of getting FMAs, without the sloppiness of fast-math.
[1] https://stackoverflow.com/questions/15933100/how-to-use-fuse...
a * d - b * c
If a == b and c == d (and all are finite), then this should give 0 (which is true for strict IEEE 754 math), but if you replace it with an fma then you can get either a positive or negative value, depending on the order in which it was contracted. Issues like this pop up in complex multiplication, or applying the quadratic formula.
- `fma(a, b, c)` is exact but may be slow: it uses intrinsics if available and falls back to a slow software emulation when they're not
- `muladd(a, b, c)` uses the fastest possibly inexact implementation of `a*b + c` available, which is FMA intrinsics if available or just doing separate `*` and `+` operations if they're not
That gives the user control over what they need—precision or speed. If you're writing code that needs the extra precision, use the `fma` function but if you just want to compute `a*b + c` as fast as possible, with or without extra precision, then use `muladd`.
There are ways, indeed, but they are pretty slow, it’s prioritizing accuracy over performance. And they’re still pretty tricky too. The most practical alternative for float FMA might be to use doubles, and for double precision FMA might be to bump to a 128 bit representation.
Here’s a paper on what it takes to do FMA emulation: https://www.lri.fr/~melquion/doc/08-tc.pdf
How can you use any numerical methods (like error analysis) if you don't have a solid foundation with strict rules to analyze on top on?
> Enables multi-objective improvement. Herbie will attempt to simultaneously optimize for both accuracy and expression cost. Rather than generating a single "ideal" output expression, Herbie will generate many output expressions. This mode is still considered experimental. This will take a long time to run. We recommend timeouts measured in hours.
The problem isn't necessarily the code they wrote themselves: it is often that they've compiled someone else's code or an open source library with fast-math, which broke some internal piece.
My guess would be that explicit state state changes show up in the generated machine code.
I recently built a modular additive music synthesizer called Flow (https://github.com/eclab/flow). When certain modules in the synthesizer [gradually] push certain state variables into the denormal range, my synthesizer will experience a roughly 100x slowdown. Mind you, this isn't due to DSP or even sound processing, and Flow isn't written in C, but in 100% pure *Java*. Since Java can't turn off denormals, I have to manually check for and zero them at strategic locations to avoid getting mired in the denormal quicksand.
I do not know how Java handles this, but maybe it actually enables exceptions for underflow which invoke some handler.
Otherwise I cannot see how you can obtain such a huge slowdown, unless your code consists entirely of back-to-back operations with denormals and of nothing else.
I am not sure what you mean by "state variables", but if they are pushed into the denormal range, they should be changed to double, not float.
If you push double variables into the denormals range, then it is likely that the algorithm must be modified, because this should not happen.
Underflows, i.e. denormals, are difficult to avoid when using float variables, which can be mandatory in DSP algorithms for audio or video, but outside the arrays processed with SIMD instructions at maximum speed, the scalar variables can be double, which should never underflow in most correct algorithms.
For computations run on CPUs, not GPUs, only very seldom there can be reasons to use a scalar float variable. Normally float should be used only for arrays.
In the context of sound, I could see this happening with an exponentially decaying envelope generator (or an IIR filter).
Ending with the data being entirely in the denormal range is a common occurrence in some audio algorithms (and in there, intel CPUs dominate by such a large margin it's not even funny) ; if that happens at the beginning of your signal processing pipeline you're in for a rough time
GPU hardware is a different, but similar story, from what I can see. It saves transistors to do FTZ, and the originally niche usage of FP to put pixels on the screen didn't really care so much about niggling details. But GPUs became general purpose and important, and they've been dragged into full compliance by application demands. It's the only sane outcome in the end. Instead, all this FTZ stuff has just made a mess at layers above. It would all be unnecessary if subnormals were as fast as AMD, ARM, IBM, and other chip manufacturers have managed to make them.
The only way to get around this in languages like Java, which cannot flush to zero, is to vigilantly check the values and flush them to zero manually.
The flip side question to you is, why should students get away with more than they need? Memory and cycles are wasting energy. We need engineers to understand how to be deeply efficient, not careless with resources. Memory is generally much more expensive than compute cycles in terms of energy use. Yes, please, teach the students how to program with less memory.
Low memory programming is a fantastic exercise for learning modern GPU programming, since you still need to conserve individual bytes when you’re trying to run ten thousand threads at the same time. Or if you’re just into Arduinos.
Other lessons that are great to learn, but take time to appreciate are how to avoid using any dynamic memory, how to avoid recursion, how to avoid function pointers or any of today’s tricky constructs (closures/futures/monads/y-combinators/etc.) I’m of course referring to how some people (like NASA) think of safety critical code https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev... But I will add that many of these rules have applied to console video game programming for a long time. They’re easing up lately, but the concepts still apply since coding for a console is effectively embedded programming.
Even when 32-bit floating point is used, for a greater dynamic range and for a 24-bit precision, instead of using 16-bit fixed-point numbers, proper DSP algorithm implementations still need to use some of the techniques that are necessary with fixed-point number algorithms, i.e. suitable scale factors must be inserted in various places.
A correct implementation must avoid almost all underflows and overflows, by appropriate scalings.
It’s not the professor’s job to minimize the student’s effort, it’s the student’s job. The arrangement is the opposite of your implication. The professor’s job is to get lazy students to confront and learn these concepts and have them practice enough to understand the concepts. Having to analyze carefully is the whole point.
I also agree about the first place you should go is to use all the tricks you have to make writing code faster… at least in business. I’m not sure that applies in school. But either way, this is precisely why school should have students practice things like floating point analysis and low-memory programming until they are part of the students’ bag of tricks, until they can do high quality engineering fluently.
BTW, just using doubles in the name of not having to analyze is not particularly great outside of school either. That does not fly where I work now (on GPU ray tracing), and would not have been acceptable when I worked in CG films or video games either. You might be underestimating how expensive doubles are. If you don’t know whether you need doubles, you probably don’t. If you have a problem that needs more than floats, and accuracy is that important, then you’ll need to justify why doubles are enough, so in practice you’ll have to analyze carefully anyway.
Maybe you’re just teasing me with the BigFloat suggestion, I can’t tell. Since they might be orders of magnitude slower than floats, they’re rarely justifiable as a robustness replacement, especially by someone who hasn’t analyzed carefully. That might be a firing offense at some jobs if done more than once. :P
And to this day, typing out the 'f' suffix on single precision literals is muscle memory for me after having had Steven Parker for my Ph.D. advisor.
Pixar has told me recently they still use doubles for some things in the CPU side of RenderMan, but I don’t know what for. There are some legitimate cases for it, and occasionally I dip my toes in hot water attempting to give advice to avoid doubles to people who know more than I do about how floats work and why they need doubles.
Steve Parker was the first person to explain to me (while he was still a student, and I was a much younger one) the sometimes surprising cost of having image sizes be powers of two (because of cache conflicts). Small world.