Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the C++ runtime

Honey, I shrunk {fmt}: bringing binary size to 14k and ditching the C++ runtime(vitaut.net)

244 points by karagenit 1 year ago | 128 comments

magnio 1 year ago |

> All the formatting in {fmt} is locale-independent by default (which breaks with the C++’s tradition of having wrong defaults)

Chuckles

tialaramex 1 year ago | |

It's really more of a committee thing - so we wouldn't necessarily expect fmt, a third party library, to have wrong defaults.

Astoundingly, when this was standardised (as std::format for C++ 20) the committee didn't add back this mistake (which is present in numerous other parts of the standard). Which does give small hope for the proposers who plead with the committee to not make things unnecessarily worse in order to make C++ "consistent".

ape4 1 year ago | | |

You can pass in a locale as a parameter. (Of course this doesn't fix the default)

formerly_proven 1 year ago | | |

I'm filing a Defect Report about std::format disrespecting locale as we speak.

h4ck_th3_pl4n3t 1 year ago |

It's kind of mindblowing to see how much code floating point formatting needs.

The linked dragonbox [1] project is also worth a read. Pretty optimized for the least used branches.

[1] https://github.com/jk-jeon/dragonbox

ziml77 1 year ago | |

I learned how much floating point formatting needs when I was doing work with Zig recently.

Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime). But this time the binary seemed to be much larger than I've seen Zig generate before and it didn't make sense based on how little the tool was actually doing. Dropping it into Binary Ninja revealed that the majority of the code was there to support floating point formatting. So I changed the code to cast the floating point number to an integer before printing it out. That change resulted in a binary that was down at the size I had been expecting.

delta_p_delta_x 1 year ago | | |

> Usually the Zig compiler can generate binaries smaller than MSVC because it doesn't link in a bunch of useless junk from the C Runtime (on Windows, Zig has no dependency on the C runtime)

MSVC defaults to linking against the UCRT, just like how Clang and GCC on Linux default to linking against the system libc. This is to provide a reasonably useful C environment as a sane default.

If you don't want UCRT under MSVC, supply `/MT /NODEFAULTLIB /ENTRY:<function-name>` in the command-line invocation (or in the Visual Studio MSBuild options).

It is perfectly possible to build a Win32-only binary that is fully self-contained and only around 1 KiB.

jk-jeon 1 year ago | |

https://github.com/jk-jeon/dragonbox/discussions/57#discussi...

We have been doing some experiment on optimizing for size, and currently it can be reduced to ~3k on 8-bit AVR. It only contains impl/table for single-precision binary32, and double-precision requires quite more, but at the same time much of the bloat is due to how limited AVR is. On platforms like x64 it should be much smaller.

You can certainly say 3k is still huge though.

mananaysiempre 1 year ago | |

> It's kind of mindblowing to see how much code floating point formatting needs.

If you want it to be fast. The baseline implementation isn’t terrible[1,2] even if it is still ultimately an implementation of arbitrary-precision arithmetic.

[1] https://research.swtch.com/ftoa

[2] https://go.dev/src/strconv/ftoa.go

vitaut 1 year ago | | |

If I interpret the numbers correctly it is of the order of ~1000 times slower than modern algorithms such as Dragonbox.

vitaut 1 year ago | |

{fmt} has an optional implementation of the old Dragon4 algorithm that is smaller in terms of code size but not as fast.

franga2000 1 year ago | |

I'm guessing the majority of use-cases limit the number of decimal points that are printed, I wonder if it would be more efficient to multiply by the number of decimals, convert to int, itoa() and insert the decimal point where it belongs...

jk-jeon 1 year ago | | |

Not sure what you mean by decimal points. Did you mean the number of decimal digits to be printed in total, or the number of digits after the decimal dot, or something else?

In any case, what Dragonbox and other modern floating-point formatting algorithms do is already roughly what you describe: they compute the integer consisting of digits to be printed, and then print those digits, except:

- Dragonbox and some of other algorithms have totally different requirements than `printf`. The user does not request the precision, rather the algorithm determines the number of digits to print. So `1.2` is printed as `1.2` and `1.199999999999` is printed as `1.199999999999`. You can read about the exact requirements in the Readme page of Dragonbox.

- The core of modern floating-point formatting algorithms is on how to compute the needed multiplication by a power of 10 without needing to do it by the plain bignum arithmetic (which is incredibly slow). Note that a `float` (assuming it's IEEE-754 binary32) instance can be as large as 2^100 or as small as 2^-100. It's nontrivial to deal with these numbers without incorporating bignum arithmetic, and even if you just give up avoiding it, bignum arithmetic itself is quite nontrivial in terms of the code size it requires.

pzmarzly 1 year ago |

> However, since it may be used elsewhere, a better solution is to replace the default allocator with one that uses malloc and free instead of new and delete.

C++ noob here, but is libc++'s default allocator (I mean, the default implementation of new and delete) actually doing something different than calling libc's malloc and free under the hood? If so, why?

londons_explore 1 year ago |

I kinda hoped a formatting library designed to be small and able to print strings, and ints ought to be ~50 bytes...

strings are ~4 instructions (test for null terminator, output character, branch back two).

Ints are ~20 instructions. Check if negative and if so output '-' and invert. Put 1000000000 into R1. divide input by R1, saving remainder. add ASCII '0' to result. Output character. Divide R1 by 10. put remainder into input. Loop unless R1=0.

Floats aren't used by many programs so shouldn't be compiled unless needed. Same with hex and pointers and leading zeros etc.

I can assure you that when writing code for microcontrollers with 2 kilobytes of code space, we don't include a 14 kilobyte string formatting library...

ptspts 1 year ago |

Shameless plug: printf(Hello, World!\n"); is possible with an executable size of 1008 bytes, including libc with output buffering: https://github.com/pts/minilibc686

Please note that a direct comparison would be apples-to-oranges though.

jart 1 year ago | |

That's because the compiler turns it into fputs

a1o 1 year ago |

> Considering that a C program with an empty main function is 6kB on this system, {fmt} now adds less than 10kB to the binary.

Interesting, I've never done this test!

JonChesterfield 1 year ago | |

It varies widely with whether the C library is dynamically or statically linked and with how the application (and C library) were built. And on which C library it is. Also a little on whether you're using elf or some other container.

neonsunset 1 year ago |

It's always fmt. Incredibly funny that this exact problem now happens in .NET. If you touch enough numeric (esp. fp and decimal) formatting/parsing bits, linker ends up rooting a lot of floating point and BigInt related code, bloating binary size.

pjmlp 1 year ago | |

Still looking forward for the Delphi like experience with Native AOT, thankfully getting better.

msephton 1 year ago |

Very enjoyable. I love these sort of thinking outside the box optimisations.

rty32 1 year ago |

Maybe I am slow, it took me a while to realize the "14k" in the title refers to "14kB"

hrydgard 1 year ago | |

What else would it possibly mean?

k is very common shorthand for kB, at least historically.

Rygian 1 year ago | | |

14000 lines of assembler?

fmt::print(fmt::emphasis::bold | fg(fmt::color::red) , "Elapsed time: {0:.2f} seconds", 1.23); fmt::print("Elapsed time: {0:.2f} seconds" , fmt::styled(1.23, fmt::fg(fmt::color::green) fmt::bg(fmt::color::blue))); fmt::print("{}", fmt::join(std::vector<int>{1, 2, 3}, ", ")); fmt::print("strftime-like format: {:%H:%M:%S}\n", 3h + 15min + 30s);