Back-end parallelism in the Rust compiler(nnethercote.github.io) |
Back-end parallelism in the Rust compiler(nnethercote.github.io) |
I wonder how many of the authors who don't use it do so because they don't know about it.
Personally, i think release builds should use codegen-units=1 by default. The meaning of a release build is "take as long as needed to build the fastest possible code". Users shouldn't have to keep track of some set of additional settings needed to achieve that. If authors want to sacrifice performance to get faster builds, or have experimentally confirmed that there is no performance impact, then they can still set codegen-units to something else.
I don't think this is true. You can always go slower and produce better code.
However in practice running the compiler at the max optimization setting is very common, so maybe there is appetite for more aggressive settings at the lower cost. I do think it makes sense to add this in some preset "profile" as it doesn't have any downsides besides time. I wonder if it makes sense to do something like gzip does, where compression level 9 is made available but is very rarely worth the time trade off. Why don't compiler settings frequently go past the point where it is reasonable for most people?
For compression settings, a higher number is usually just setting limits on the search space, and perhaps setting upper limits on the amount of memory required. Setting higher compression levels rarely executes lots of extra code.
For compilers, higher levels typically enable entirely new optimizations. That means lots of extra writing of optimizations that will almost never get used, are often tough to debug (particularly their interaction with other optimizations), and increase the maintenance cost of the compiler.
Back to our example: if the compiler doesn't have any information about how likely different cases are then it can't actually know what is going to be fastest. So your compiler just uses some heuristics to guess and pick something that is reasonable (but is often not optimal). This problem arises all over the place in optimization passes: the compiler could generate code X or code Y, but if it's just looking at your source code it doesn't know for sure which is going to be faster, so it just chooses something based on an arbitrary heuristic. If strategy X performs better than strategy Y 90% of the time, then the compiler is still making the wrong decision 10% of the time.
This is the fundamental reason why having an extra slow "max optimization" mode doesn't work. The reason the compiler doesn't generate better code isn't necessarily because the compiler doesn't have enough time to optimize things, it's because it's working with limited information. In my experience (from C++), the compilation times with clang vary only a small amount when compiling with -O1 vs -O2 vs -O3 (maybe 5%). And in fact in clang there are many optimization passes that are not enabled by default at -O3. The reason they aren't enabled is because they usually have dubious benefit (i.e. equally as likely to improve performance as to worsen performance).
I'm not a Rust programmer, but at least in C++ land the first step to generating optimized binaries should be using -O3 and microarch flags (e.g. -march), and the next most important thing is using FDO to optimize builds. Using FDO to optimize builds will have a much bigger impact on performance than fiddling with other optimization flags. FDO works by collecting branch profiles from compiled binaries, and then feeding that information back into the compiler so it can make better code generation and layout decisions. You can gather FDO profiles on any binary (on Linux you just need to collect branch events using perf), so if you really want better optimization that's my recommendation.
Regarding the low core-count, you might actually lose performance when having less CGUs, since your absolute estimation error is larger.
For example if a CGU can take 3 times as long to compile as another with the same estimate, we might end up with a 0.75:0.25 split if we have 2 CGUs, ending up with a 50% increase in compilation time, but with 16 CGUs the worst case (one CGU taking a 1/6 of the time and 15 CGUs taking 1/18 of the time) will only result in ~11% increase in compilation time, due to the automatic balancing that job scheduling gives us.
I’m also curious if it might be possible to pipeline the CGU stage so that you feed the data from rustc to LLVM incrementally as you lower MIR into LLVM IR without having the entire CGU upfront. That may help reduce total processing time by hiding brief IO bubbles that take up a lot of time in aggregate (yes it’s CPU bound, but there’s still going to be implicit IO like memory) and also reduce total peak memory usage for hopefully obvious reasons. To gain full benefit though you might need the entire thing to be pipelined all the way through (from generating the MIR to lowering to LLVM IR) and that may be at odds with things like optimized builds which need to do analysis from a more global perspective (which would similarly inhibit the peak memory usage gains).
Nick: you know me, I love to overcomplicate things.
I don't know if you would consider this to be machine learning, but the size estimation seems like a good fit[1] for multiple linear regression. You have some independent variables (number of MIR nodes, number of basic blocks) and an error in a dependent measure you care about (time to compile), so it seems worth running through a linear regression model to get rough weights for your independent variables. (eg 7 × number of BBs + number of MIR nodes or whatever.) Sure, there's the risk of overfitting to your machine and number of cores / amount of RAM / whatever. But if the size estimation is really what's throwing off your other experiments, it seems worth spending some time to come up with other possible inputs (independent variables) and cutting that error down.
You might consider using something other than squared error for the regression, since at this point you know a fair bit about the effects of different errors.
If you happen to know that the compiler does stuff that is quadratic in number of basic blocks, you might even want to consider a nonlinear term like the square of the number of basic blocks. (So instead of `time ~ nodes + BBs`, you'd have `time ~ nodes + BBs**2` or `time ~ nodes + BBs + BBs**2` but still just do linear regression.) The square of an independent variable could also be seen as a way of modeling "this factor matters, but only when it gets big". You can play with other powers like square roots, which are more like "the value of this factor matters, but large values aren't that different from each other".
But the more of those games you play, the more you're going to overfit, so you probably ought to extract weights from a subset of your data and withhold the rest for testing. It'll still overfit (to your computer, for example), but at least it'll be better.
Last thought: you've broken down compilation time into its different phases before, right? I'm thinking of things like register allocation. Is there an easy-to-compute independent variable related to that, like maybe "number of MIR instructions in regions of the control flow graph that have more than 4 live variables" or something? Probably not, that's probably only visible later. Number of loops / back edges in the AST? I don't know enough about how this stuff works.
[1] Pun not intended. Until I made it.
Is there a particular reason why this can't be parallelized?