How to speed up the Rust compiler one last time(blog.mozilla.org) |
How to speed up the Rust compiler one last time(blog.mozilla.org) |
> It’s rare that a single micro-optimization is a big deal, but dozens and dozens of them are. Persistence is key
Persistence is work. Mozilla is cutting the people who put in the work of staving off bitrot
To clarify: I am still at Mozilla! But I will be working fully on Firefox for the foreseeable future. I have edited the opening paragraph of the post to make this clearer.
> I also did two larger “architectural” or “top-down” changes
My summer intern started doing profiling work on compile times with clang: https://lists.llvm.org/pipermail/llvm-dev/2020-July/143012.h...
Some things we found:
* for a large C codebase like the Linux kernel, we're spending way more time in the front-end (clang) than the backend (llvm). This was surprising based on rustc's experience with llvm. Experimental patches simplifying header inclusion dependencies in the kernel's sources can potentially cut down on build times by ~30% with EITHER gcc or clang.
* There's a fair amount of low hanging fruit that stands out from bottom up profiling. We've just started fixing these, but the most immediate was 13% of a Linux kernel build recomputing target information for every inline assembly statement in a way that was accidentally quadratic and not being memoized when it could be (in fact, my intern wrote patches to compute these at compile time, even). Fixed in clang-11. That was just the first found+fixed, but we have a good list of what to look at next. The only real samples showing up in the llvm namespace (vs clang) is llvm's StringMap bucket lookup but that's from clang's preprocessor.
* GCC beats the crap out of Clang in compile times of the Linux kernel; we need to start looking for top down optimizations to do less work overall. I suspect we may be able to get some wins out of lazy parsing at the cost of missing diagnostics (warnings and errors) in dead code.
* Don't speculate on what could be slow; profiles will surprise you.
> Using instruction counts to compare the performance of two entirely different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable to use them to compare the performance of two almost-identical programs
Agree. We prefer cycle counts via LBR, but only for comparing diffs of the same program, as you describe.
rustc sends large, generally unoptimized chunks to llvm, compared to clang. In Rust, the translation unit is at the crate level, causing llvm to do more analysis. MIR is also still relatively new and I think there is still work to be done doing optimizations in it to get less data sent to llvm.
There's something satisfying about seeing code get cleaned up and optimized. I also enjoyed following the LibreOffice commits back when they were in their "heavy cleanup" phase after it became clear OpenOffice was dead (which meant they didn't have to worry about diverging from the upstream anymore).
This is a supremely surprising conclusion, especially in 2020. Is instruction count really still tied to wall clock count? I would have thought that some instructions could be slower than others (especially on x86) so that using more faster individual instructions could be faster than 1 slower instruction. Similarly, cache effects & data dependencies can result in more instructions being faster than fewer instructions.
I think what the author is trying to say is that when evaluating micro-optimizations, cycle counts are pretty valuable still because you're making a small intentional change & evaluating its impact & usually the correlation holds. The dashboard clearly still measures wall-clock since just comparing instruction count over time would be misleading.
I'm curious if the Rust team has evaluated stabilizer to be more robust about the optimizations they choose: https://emeryberger.com/research/stabilizer/
IMO compiler speed still remains the main ergonomics hurdle in developing Rust software.
If there are any smart rust-using company, they should definitely hire nnethercote to continue their excellent work!
I would have loved these blog posts regardless of what code was actually being optimised.
They offer a fascinating glimpse into a workflow that requires expertise, experimentation and creativity.
Sadly something that most developers can't engage in very often, due to the nature of their work or time constraints.
> Due to recent changes at Mozilla my time working on the Rust compiler is drawing to a close.
This sort of statement makes me a bit worried though. I don't mean to echo what a lot of the community has said over the past month, but I really hope that development on Rust doesn't stagnate because of the layoffs.
That's why I started the paragraph with "Contrary to what you might expect".
As for Stabilizer: "Stabilizer eliminates measurement bias by comprehensively and repeatedly randomizing the placement of functions, stack frames, and heap objects in memory." Those placements can affect cycle counts and wall times a lot, but don't affect instruction counts.
Also is there any work to multi-thread the Rust compiler on a more fine-grained level like the recent GCC work? I know you allude to that potentially that would make the instruction counts potentially less reliable so wondering if that's something being explored.
Finally, while I have you, I'm wondering if there's been any exploration of the idea of keeping track of information across builds so that incremental compilation is faster (i.e. only bother recompiling/relinking the parts of the code impacted by a code change). I've always thought that should almost completely eliminate compilation/linking times (at least for debug builds where full utmost optimization is less important).
> We see that something external and orthogonal to the program, i.e., changing the size (in bytes) of an unused environment variable, can dramatically (frequently by about 33% and once by almost 300%) change the performance of our program. This phenomenon occurs because the UNIX environment is loaded into memory before the call stack. Thus, changing the UNIX environment size changes the location of the call stack which in turn affects the alignment of local variables in various hardware structures.
From https://www.inf.usi.ch/faculty/hauswirth/publications/asplos....
And yes. I'm aware of that result because of Professor Berger's talks on Coz & the other work he's done in this space.
There are some fun cases where that is definitely true, to whit pdep / pexp on Zen based architectures. https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-rep...
https://twitter.com/uops_info/status/1202950247900684290
> I just ran some tests: the performance seems to depend heavily on the value in the last operand; this is also the case for the register variants. If the last operand is set to -1 (i.e., all bits are 1), the instr. has 518 uops and needs more than 289 cycles!
Ahh, very well said!
Rust is up and coming language.
I'd also like to see the Rust ecosystem prosper, but I guess others can take up the slack, it is gaining considerable momentum and quite a few places are looking into it, or using it already. If that isn't possible, is there much hope for it anyway?
Rust translation units do not have the header file problem (so the frontend does less work), and they are also much larger in terms of definitions, often spawning multiple files, so there is more for the backend to do per translation unit.
The consequence is that Rust spends more time on LLVM relatively speaking than C and C++.
The solution to this problem in Rust is naively simple: write smaller translation units.
Rust programmers just want to structure their code however their want, and still have good compile-times. Which is kind of the opposite of how C and C++ programmers typically structure their code in the largests projects, because they value faster compile-times over that kind of "ergonomics"/code organization.
Orthogonal to your point:
Also, aggressively marking functions __attribute__((always_inline)) we found was blowing up compile times (reoptimizing the same code again and again).
Finally, expansions of function like macros containing GNU C statement expressions could cause the preprocessed source to bloat very quickly (megabytes of input, IIRC).
The "always inline" problem is smaller in Rust, because it does Thin LTO by default, compiles partially to bit-code, that can be partially inlined if necessary, etc. So essentially the Rust toolchain is a bit better at inlining when its profitable than C and C++ toolchains "by default" (you can tune C and C++ toolchains like clang to be as good as Rust).
In C and C++, macros are generally frowned upon, not because of this compile-time issues, but rather because they are dangerous, tricky, non-hygienic, powerful usage requires complex patterns, their interaction with header files and PCHs, etc.
Rust macros are awesome, super useful, widely used, etc. So people end up using them a lot, and this is by design, not a flaw of the language. This ends up resulting in a lot of duplicated code being expanded, and that leads to the problem that you mention, but much worse, because there are just many more macros in Rust.
This kind of applies to templates / traits as well. There are many C++ that don't write templates, but all Rust programmers use generics because they are great. Rust can pre-compile generics, so the cost of this is not as bad as for C++ where the same generic might be instantiated by multiple TUs. But still, just by the fact that they are more widely used, the size of the problem grows.
There is an experimental parallel rustc front-end, e.g. see https://internals.rust-lang.org/t/help-test-parallel-rustc/1...
> any exploration of the idea of keeping track of information across builds so that incremental compilation is faster
That's exactly what incremental compilation does.
http://smallcultfollowing.com/babysteps/blog/2019/01/29/sals...
http://smallcultfollowing.com/babysteps/blog/2020/04/09/libr...
> The only chance I see for Firefox is doubling down on that and replacing more and more components with those Rust rewrites
If Firefox has no more Rust developers now, it can't "double down" on replacing components, and all software is prone to bitrot, so after a few years, it will be an unmaintainable, buggy mess.
Mozilla did lay off most of its employees that were working directly on the Rust language and its implementation. This was a handful of people.
Mozilla also laid off some employees that were using Rust, such as the Servo team.
But Mozilla still has plenty of employees that know and use Rust, both in Firefox (e.g. the WebRender team), and in code relating to Firefox such as services. This is a much larger number of people.
My understanding is that there is still plans for more Rust in Firefox.