Floating Point Visually Explained (2017)(fabiensanglard.net) |
Floating Point Visually Explained (2017)(fabiensanglard.net) |
There is no easy way to make it work in decimal. So you need to learn how fractional parts work in binary, then move up to scientific notation in binary. Then you probably noticed that all numbers start with 1, so you can knock it off to save space. Oh, and add a special case for zero, because it is the only number that doesn't start with 1, and if you are feeling adventurous, there are subnormal numbers.
That's why I find the explanation in the article absolutely brilliant. By treating the exponent as a window, not only you don't need to bother with scientific notation in binary, but zero and subnormal numbers fit very nicely.
In fact, I've been working for years with floating point numbers and feel stupid for not realizing it works just as described in the article. And feeling stupid after reading something is usually a very good sign. I clicked, expected an article I could dismiss in a typical Hacker News fashion, but no, not this time.
The window and offset explanation naturally accounts for the leading bit, whereas the scientific notation explanation needs further explanation to explain how the leading bit works.
In short, 10^2 * .001 is not allowed in floating point. You can't have a mantissa that starts with 0. The leading bit means all mantissas must be greater than 1.
Without this understanding you won't have an intuitive understanding of the range and precision of floating point, which is why I think the window+offset explanation is much more natural.
In binary, we always know the first non-zero digit of any number - so there's no need to write it down. We know it's 1 because, well, binary.
So we don't waste space, and only write down all the other digits, and then use the exponent to put the 'point' into the right place.
We save a bit of space doing this.
The mantissa can only be a value from [1,2).
i.e. in scientific notation: exponent * mantissa, the mantissa cannot be less than 1, or >= 2 in floating point.
So given that the mantissa can go from 000000 to 111111, what should the values represnet? Obviously it should represent the values from [1,2). Calling it a leading bit is more confusing than it needs to be. Its better to just call it an assumed minimum value.
When I was taught scientific notation in middle school, high school, and at college, it was always explicitly stated that:
1. You can have only one digit before the decimal point.
2. That digit cannot be 0 (unless your number is 0, of course). So 0.3 * 10^5 is not scientific notation.
This is no different. The "twist" is that there is only one possible non-zero bit, whereas in decimal it could be [1-9].
I think this explanation is neat. However, the "usual" formulation is just scientific notation, with the optimization that one bit is redundant.
Personally, I prefer the alternative notation: M * 2^(exponent-precision+1), with M being a p-bit integer. It's easier to work with when you know that M is always an integer, and you don't need to deal with fractions in base 2. In fact, FP made a lot more sense when I took this formulation in decimal, and worked with that.
Demo: #include <stdio.h>
int
main ()
{
volatile float v;
float acc = 0;
float den = 1.40129846432e-45;
for (size_t i; i < (1ul<<33); i++) {
acc += den;
}
v = acc;
return 0;
}
With -01:
$ gcc float.c -o float -O1 && time ./float
./float 8.93s user 0.00s system 99% cpu 8.933 totalWith -O0: $ gcc float.c -o float -O1 && time ./float ./float 20.60s user 0.00s system 99% cpu 20.610 total
/tmp>cat t.c
#include <stdio.h>
int main() {
float acc = 0;
float den = 1.40129846432e-45;
for (size_t i = 0; i < (1ul<<33); i++) acc += den;
printf("%g\n", acc);
return 0;
}
/tmp>gcc -O3 t.c && time ./a.out
2.35099e-38
./a.out 5.94s user 0.00s system 99% cpu 5.944 total
/tmp>gcc -O3 -ffast-math t.c && time ./a.out
0
./a.out 1.50s user 0.00s system 99% cpu 1.502 total
So subnormal numbers are supported in -O3 unless you specify -ffast-math. And it definitely makes a difference (gcc 9.3, Debian 11, Ryzen 7 3700X).EDIT That one is interesting too
/tmp>clang -O3 t.c && time ./a.out
2.35099e-38
./a.out 6.04s user 0.00s system 99% cpu 6.044 total
/tmp>clang -O3 -ffast-math t.c && time ./a.out
0
./a.out 0.10s user 0.00s system 99% cpu 0.101 total
clang (9.0.1) performs about the same without -ffast-math; but with it, it managed to optimize the loop away.I've been fighting the compiler to generate a minimal working example of the subnormals, but didn't have any success.
Some things take need to be taken in account (from the top of my head):
- Rounding. You don't want to get stuck in the same number. - The FPU have some accumulator register that are larger than the floating point register. - Using more register than the architecture has it not trivial because the register renaming and code reordering. The CPU might optimize in a way that the data never leaves those register.
Trying to make a mwe, I found this code:
#include <stdio.h>
int
main ()
{
double x = 5e-324;
double acc = x;
for (size_t i; i < (1ul<<46); i++) {
acc += x;
}
printf ("%e\n", acc);
return 0;
}
Runs is fraction of seconds with -O0: gcc double.c -o double -O0
But takes forever (killed after 5 minutes) with -O1: gcc double.c -o double -O1
I'm using gcc (Arch Linux 9.3.0-1) 9.3.0 on i7-8700I also manage to create a code that sometimes run in 1s, but in others would take 30s. Didn't matter if I recompiled.
Floating point is hard.
$ gcc-9 -Wuninitialized -O0 float.c # NO WARNINGS!!!
$ gcc-9 -O1 float.c # NO WARNINGS!!!
$ gcc-9 -Wuninitialized -O1 float.c
float.c: In function 'main':
float.c:9:5: warning: 'i' is used uninitialized in this function [-Wuninitialized]
9 | for (size_t i; i < (1ul << 33); i++) {
| ^~~ $ gcc hn.c -O0 -o hn
for (size_t i; i < (1ul<<46); i++) {
printf ("%zd\n",i);
acc += x;
break;
}
Without break: $ gcc hn.c -O0 -o hn
for (size_t i; i < (1ul<<46); i++) {
printf ("%zd\n",i);
acc += x;
}Another explanation using the window + offset terminology used in the post is that the offset is a percentage of the way through the window. So, for a window of 2^x, the difference between an offset of y and y + 1 is 2^(x-23) or 1/2^(-23) of 2^x. Put another way, floating point numbers do not have absolute error like integers (each number is within 1 of a representable value), but % error (each number is within 1/2^(-23) of a representable value). Essentially, floating point numbers use % error bars instead of absolute error bars.
Using this model you can even see how to create your own floating point numbers. Just pick a % precision you want, for single FP that is 1/2^(-23) and double FP 1/2^(-52), that defines the range of your mantissa (offset). Then pick a range of x values you want to represent, that is the range of your exponent (window).
As an aside, subnormal numbers do not respect this principle. They extend the expressible range for very small numbers by sacrificing % error for those numbers. In the very worst case of the smallest subnormal number you can get 25% error (it might actually be 50%). As might be imagined, this plays havoc on error propagation since if you ever multiply by a number that just so happens to be the smallest subnormal, all your multiplies might suddenly be off by a factor of 25% instead of the normal 100 * 2^(-23)% which is 2,000,000 times the % error which is quite a bit harder to compensate for. This is why many people consider subnormals to be a blemish.
[1] The approximation is actually offset in the x direction for the bias. If you want to be more accurate, you are actually graphing 2^(x - 127).
I am also interested in your statement that certain algorithms depend on subnormals as defined in the floating point spec. Can you provide an example of such an algorithm? I can intuit how it might be desirable to have a single "too small/epsilon" value, but I do not see offhand how you can leverage the full range of subnormals in any reasonably generic way that is not extremely dependent on the specific problem and scale (i.e. multiply all numbers by 2^30 and increase the maximum exponent by 30, do you get the same output multiplied by 2^30), so I would like to see how it is done.
In terms of my two possible interpretations, the first is that subtracting two unequal floating point numbers yield 0. I am pretty sure this is not the case. It may yield no change, but I am pretty sure it can not yield 0. The other is that subtracting two unequal arbitrary precision numbers represented as floating point numbers yield 0. This is true, but is a known limitation of emulating arbitrary precision arithmetic using limited precision and must always be accounted for. If this is what you meant, then we can just choose numbers too small to be expressed by subnormals to cause the problem to occur again, so all it does is allow handling of a few more cases at the cost of complexity and non-uniformity. If you did not mean either of these two interpretations, can you explain what you meant, preferably with a concrete example?
Think 32-bit floats as a 256-bit buffer interpreted as a fixed precision number, with the decimal point straight in the middle, but with the limitation that you can set only a continuous 24-bit window on that buffer to non-zeros.
Then, the 8 bits of the exponent determine where in the bit buffer that window points to, and the 23 (+1) bits are the contents of the window.
0: https://stackoverflow.com/questions/43965347/ulp-unit-of-lea...
There is two upsides to this: 1.) All quantities are order one, safely away from underflow and infinity. 2.) The equations are generally simpler, which was important when the many limitation in simulation codes was flops, not memory bandwidth.
There are also important downsides: 1.) You have to manipulate the equations before you start coding. Porting features from a code that used another normalization is quite error-prone. 2.) All quantities are order one, which makes it hard to detect if you accidentally use and electric field instead of a magnetic field. All you see is a number of roughly the correct magnitude. 3.) Comparison between codes is more difficult, in particular if the use different normalization, because you have to convert from "code units" to actual SI units.
https://books.google.de/books?id=uzJbyD6_3DAC&pg=PA4
And yes you have to be consistent and stay in that unit system. The other remaining two we used plain Kelvin for temperature and for the charge Coulomb was measures in electron charge. That's it.
https://en.wikipedia.org/wiki/Half-precision_floating-point_...
Particularly because it is easier to think of 32 bit IEEE-754 to BFloat16, but with fewer bits of think about (& possibly you can enumerate the entire range in a laptop to test something like "will this function work for all values?").
[1] - https://en.wikipedia.org/wiki/Bfloat16_floating-point_format...
Now I know I'm weird, as that formula makes sense to me.
Floating points, like two's complement, are hardware derived/limited creations. You have to understand the underlying hardware and binary limitations to understand why they exist and why they were created in that manner. The same thing applies to ascii. Zero is 48, A is 65 and a is 97. The mapping seems arbitrary unless you see the bit pattern and you realize how clever the design is and the reason why some of the symbols got their mappings in ascii.
> While I was writing a book about Wolfenstein 3D[1], I wanted to vividly demonstrate how much of a handicap it was to work without floating point
I would've expected fixed point to work fine for games, because that's a domain where you know the data, and in particular the dynamic range, in advance, so the 'automatically adjust to whatever dynamic range happens to be in the data' feature of floating point isn't needed. What am I missing? (If the answer is 'it would take too long to explain here, but he does actually explain it in the book', I'm prepared to accept that.)
Floating point numbers are just fractions but with one extra condition: the denominator is a power of two.
The normal rules of fractions apply. If you want to add them, you have to make sure the denominators match, which would involve scaling the numerators too.
Just like fractions, there are multiple ways of writing the same value. 3/2 is the same as 6/4.
You can't write 1/3 exactly because, hey look at your denominator, it's 3. Which isn't a power of 2, is it? So that can't be a floating point value.
This is not correct. In binary you only have 2 choices {0,1} for the numerator, so 3/2 is represented 1.1 and 6/4 can only be represented as 1.1 (feel free to add the completing 0s)
3 in binary is 11. 6 in binary is 110.
3/2 (decimal) is 11/10 (binary) and 6/4 (decimal) is 110/100 (binary).
Of course the denominators always have trailing zeros, so can be (and are) represented more compactly.
The floating number line is actually an (unevenly spaced in reals, uniformly spaced in binary) discrete number line that was used to approximate continuous infinite real numbers, in finite precision e.g.
Reals
0 __________________________________________________ 1
Floats
0 .. .... .. ... .. ... ... ... .. ... ... ... .. .... ... ... ... . ... .... ... .. . ... 1
0.... . . . . . . . . . . . 1
(Hacker News eats consecutive spaces, it seems like :/)"Trivia: People who really wanted a hardware floating point unit in 1991 could buy one. The only people who could possibly want one back then would have been scientists (as per Intel understanding of the market). They were marketed as "Math CoProcessor". "
The 486DX (1989) was already common in 91/92 and came with a floating point unit. I had a 50mhz 486DX, and I was not by any means wealthy. FP unit was certainly used by lots of software, especially things like Excel, but C compilers certainly produced code for it if you had one.
Likewise, the 68040 (1990) had onboard FP. Macintosh Quadra, Amiga 4000, and various NeXT models had it.
Yes if you bought a 386 you often had to get a floating point co-processor as upgrade, but it wasn't _that_ uncommon. Same on the 68k series; I knew people with Atari MegaSTes (68000) that bought an FP co-processor. They weren't astronomers :-)
This feels like recent history, am I really that old?
* [2^-127, 2^-126] * ... * [1, 2] * [2, 2^2] * [2^2, 2^3] * ... * [2^128, 2^129]
and the window tells you where you are in this "window". You have 8 bits to figure out where you are in there, so it's obvious that you have much more precision if you are in the window [1, 2] than in the window [2^128, 2^129]
Discussed at the time: https://news.ycombinator.com/item?id=15359574
(Reposts are ok after a year or so.)
The IEEE-754 has a lot of redundant representation. Not where you would expect though.
Caveat: Those features are invaluable for some niche applications, but not for the average joe.
To start. Every IEEE-754 float has two zero representation: one for positive zero and another negative negative zero (sic).
The special numbers are another source of redundancy. The the double format, have about 9,007,199,254,740,992 different combination to encode three different states that a production ready software shouldn't reach: NaN, +inf and -inf.
Other than the redundancy, the double have many rarely used combination. For instance, the subnormals representation. Unless you are compiling your program with -O0 or with some exotic compiler, they are disabled by default. One subnormal operation can take over a hundread of cycles to complete. Therefore, more 9,007,199,254,740,992 wasted combination.
If that wasn't bad enough, since the magnitude of the numbers follows a normal distribution (someone whose name I forgot's law), the most significant bits of the exponent field are very rarely used. The IEEE-754 encoding is suboptimal.
The posit floating point address all those issues. It uses an tapered encoding for the exponent field.
1. -0 now encodes NAN.
2. +inf/-inf are all Fs with sign: 0x7FFFFFFF, 0xFFFFFFFF.
3. 0 is the only denorm.
Which does four good things:
1. Gets rid of the utter insanity which is -0.
2. Gets rid of all the redundant NANs.
3. Makes INF "look like" INF.
4. Gets rid of "hard" mixed denorm/norm math.
And one seriously bad thing:
1. Lose a bunch of underflow values in the denorm range.
However, as to the latter: who the fuck cares! Getting down to that range using anything other than divide-by-two completely trashes the error rate anyways, so why bother?
The rest of Gustafson's stuff always sounds like crazy-people talk, to me.
But isn't that accounted for by the fact the floating point number distribution is non-uniform? Half of all floating point numbers are between -1 and 1.
Could you elaborate on this? Maybe an example? This sounds interesting but I'm failing to grasp it.
“IEEE 754 […] Has the interesting and useful property that two's complement comparisons of the underlying bit pattern of any two IEEE 754 numbers will have the same result as comparing the numbers that are represented”
That means that, if you interpret the bits of a float/double as an int32/int64, increase that integer by one, and then interpret the bits of the result as a float/double, you get the smallest float/double that’s larger than what you started with (with exceptions for NaN, infinities, +/- zero, and, possibly, some categories I forget)
That can be useful if you want to iterate over all floats between 0.0 and 1.0, for example, but may not be that efficient on some modern hardware, where moving data from the “float side” of the CPU to the “integer side” is expensive)
That sucks. It means I cannot represent any number between 0 and 1.
Ideally, we want the exponent to be somewhat symmetric, which in this case means going from -127 to 127. To translate 0 to 255 to that interval, you subtract 127. You are simply shifting your exponent so that the allowed values for your exponent is symmetric about 0.
Now your window starts with [2^-127, 2^-126] and so on.
(I may have an off by one error here).
So now, take the 16-bit pattern (0000 0011 1110 0000); you get a 0 sign bit, an exponent of 0, and a mantissa of 1.96875. So, with a bias of -15, that's 1.96875 * 2 ^ 0-15 = 0.00006008148193359375.
As you creep up to the "next" exponent, you see that the boundaries are respected. The last 2^-15 number is 0000 0011 1111 1111 (0x3ff) and the first 2^-14 number is 0000 0100 0000 0000 (0x400); now the exponent has changed from 0 to 1, but the floating point converts to 1.999023 * 2 ^ -15 = 0.0000610053539276123046875 and 1.000000 * 2 ^ -14 = 0.00006103515625: a bit-by-bit comparison has 0x3ff < 0x400. (This would be true regardless of bias).
Now imagine that IEEE 754 stored exponents in two's-complement format instead; the exponent 01111 would be interpreted as +31, but the "next" exponent, bit-wise, would be 10000 = -32. This means that you'd end up with 0011111111111111 = 1.999023 * 2 ^ 31 = 4292869204.475904, but the next binary number, 0100000000000000 would be 1.0 * 2 -32 = 0.00000000023283.
If you have a process that converts floats into other formats with more restricted exponents, like human readable strings, it might matter.
(Also, the real "waste" is only on the multiple NaN values, since the zeros always "waste" only a single value for the "negative zero", and the infinities always "waste" only two values; AFAIK, both negative zero and the infinities are necessary for stability of some calculations.)
My reasoning is about how much information can be encoded in the format.
The IEEE-754 double format have 11 bits to encode the exponent and 52 bits to encode the fraction.
Therefore, the multiplying factor from double is in the range: 2^1023 to 2^-1022. To give an idea how large this is, the scientist estimate there are about 10^80 atoms in universe, in base 2 this is "little" less than 2^266.
Most application only don't work with numbers on this magnitude. And the ones that does, don't care so much about precision.
Let me know if there is something wrong with my logic.
The set of floating point values is intentionally biased towards smaller numbers. So yes, while few people have a need to deal with such large numbers, there are also far fewer such large numbers.
I think your flaw is that you are looking at the total set of all possible floating point numbers, and seeing a chunk that few people will use. Don't do that. Look at the 63 bits, and point out which ones you would like to remove/compress/etc. Yes, the combination of having an exponent of all 1's is rare, but none of those bits individually is "rare". The MSB in the exponent is used to represent all the numbers from 1 to 2, for example.
I don't doubt one can come up with a clever scheme that provides a different, encoding which is even more heavily biased towards more "reasonable" numbers, but it's not clear what the gain would be. You'd have to come up with novel algorithms for all the floating point operations (addition, division, etc) - and would they be as fast as the current ones?
I've yet to find real world problems for which the current encoding is pretty poor. Contrived ones, sure - but real problems? Rare.
According to Wikipedia (https://en.wikipedia.org/wiki/Half-precision_floating-point_...), this is the IEEE 754 standard binary16 format.
Changing a 0 bit to a 1 in an integer increases its value, and by the rule I gave, should also increase the value of the floating point number with the same bit pattern. It doesn’t matter whether that flipped bit is in the mantissa or the exponent.
That requires the use of the biased number in the exponent. In particular, the “all zeroes” bit pattern for the exponent must be the representation for the lowest possible exponent, and the “all ones” one that for the highest. It cannot be the ‘normal’ two’s complement representation of -1.
When working with numbers that exceed the posit representation you use the quire to accumulate. At the end of the computation you convert again to posit to store in memory, or store the quire in memory.
In C, it would look like something like:
posit32_r a, b;
quire_t q;
q = a; // load posit into quire
q = q + b; // accumulate in quire
a = q; // load quire into posit
> The rest of Gustafson's stuff always sounds like crazy-people talk, to me.I've read all his papers on posit and agree. But I do believe the idea of encoding exponent with golomb-rice is actually very good and suit most users. The normalization hardware (used in the subtraction operation) can be easily repurposed to decode the exponent and shift the exponent.
But the quire logic (fixed point arithmetic) might use more area than a larger float-point. But maybe in power usage it pays of.
This would imply that 1 / (1 / -inf) would now be +inf instead of -inf.