Back-end parallelism in the Rust compiler

Back-end parallelism in the Rust compiler(nnethercote.github.io)

174 points by edmorley 2 years ago | 16 comments

twic 2 years ago |

> Setting codegen-units to 1 gives even better code quality than thin local LTO, but takes longer. Some authors of binary rust crates always use that setting for release builds because they are willing to accept the extra compile times for the highest code quality.

I wonder how many of the authors who don't use it do so because they don't know about it.

Personally, i think release builds should use codegen-units=1 by default. The meaning of a release build is "take as long as needed to build the fastest possible code". Users shouldn't have to keep track of some set of additional settings needed to achieve that. If authors want to sacrifice performance to get faster builds, or have experimentally confirmed that there is no performance impact, then they can still set codegen-units to something else.

kevincox 2 years ago | |

> take as long as needed to build the fastest possible code

I don't think this is true. You can always go slower and produce better code.

However in practice running the compiler at the max optimization setting is very common, so maybe there is appetite for more aggressive settings at the lower cost. I do think it makes sense to add this in some preset "profile" as it doesn't have any downsides besides time. I wonder if it makes sense to do something like gzip does, where compression level 9 is made available but is very rarely worth the time trade off. Why don't compiler settings frequently go past the point where it is reasonable for most people?

KMag 2 years ago | | |

Allowing higher compression levels generally has a minimal impart on the maintenance cost of the compression code, so there's essentially only a cost to the user for allowing arbitrarily high compression level settings. Higher optimization levels in a compiler generally greatly add to the code complexity, and thus impose a tax on the developers, not just a cost to the users.

For compression settings, a higher number is usually just setting limits on the search space, and perhaps setting upper limits on the amount of memory required. Setting higher compression levels rarely executes lots of extra code.

For compilers, higher levels typically enable entirely new optimizations. That means lots of extra writing of optimizations that will almost never get used, are often tough to debug (particularly their interaction with other optimizations), and increase the maintenance cost of the compiler.

eklitzke 2 years ago | | |

Realistically there is only so much optimization a compiler can do without having execution profiles of code. There is a HUGE amount of optimization a compiler can do if it knows how likely various branches are to be taken. For example, consider a simple switch statement. There are basically two ways to do code generation: you can use a jump table, or the case can be rewritten as if it's a series of if/else statements (i.e. by generating a cascade of compare/jump instructions). And in the latter case, what order the case statements are placed by the compiler will make a huge difference in performance, as you want the most likely case statements to come first. Note that I'm just mentioning switch statements here as they're easy to understand, this type of problem is ubiquitous when it comes to code generation decisions made by the compiler.

Back to our example: if the compiler doesn't have any information about how likely different cases are then it can't actually know what is going to be fastest. So your compiler just uses some heuristics to guess and pick something that is reasonable (but is often not optimal). This problem arises all over the place in optimization passes: the compiler could generate code X or code Y, but if it's just looking at your source code it doesn't know for sure which is going to be faster, so it just chooses something based on an arbitrary heuristic. If strategy X performs better than strategy Y 90% of the time, then the compiler is still making the wrong decision 10% of the time.

This is the fundamental reason why having an extra slow "max optimization" mode doesn't work. The reason the compiler doesn't generate better code isn't necessarily because the compiler doesn't have enough time to optimize things, it's because it's working with limited information. In my experience (from C++), the compilation times with clang vary only a small amount when compiling with -O1 vs -O2 vs -O3 (maybe 5%). And in fact in clang there are many optimization passes that are not enabled by default at -O3. The reason they aren't enabled is because they usually have dubious benefit (i.e. equally as likely to improve performance as to worsen performance).

I'm not a Rust programmer, but at least in C++ land the first step to generating optimized binaries should be using -O3 and microarch flags (e.g. -march), and the next most important thing is using FDO to optimize builds. Using FDO to optimize builds will have a much bigger impact on performance than fiddling with other optimization flags. FDO works by collecting branch profiles from compiled binaries, and then feeding that information back into the compiler so it can make better code generation and layout decisions. You can gather FDO profiles on any binary (on Linux you just need to collect branch events using perf), so if you really want better optimization that's my recommendation.

ashdnazg 2 years ago |

Interesting article!

Regarding the low core-count, you might actually lose performance when having less CGUs, since your absolute estimation error is larger.

For example if a CGU can take 3 times as long to compile as another with the same estimate, we might end up with a 0.75:0.25 split if we have 2 CGUs, ending up with a 50% increase in compilation time, but with 16 CGUs the worst case (one CGU taking a 1/6 of the time and 15 CGUs taking 1/18 of the time) will only result in ~11% increase in compilation time, due to the automatic balancing that job scheduling gives us.

kevincox 2 years ago | |

I had the same thought. It may be interesting to try producing smaller CGUs then processing them on a fixed-size thread pool. This should help even out the misses. Especially if you keep a few smaller CGUs for the end, they should be "free" to give to threads that would otherwise be finished compared to merging them into a different CGU which may happen to be under-estimated and take longer.

vlovich123 2 years ago |

Interesting articles. For the size estimation, it might be interesting to measure what a perfect estimate would give you. Because in theory that estimate could be done based on builds in a trusted environment and saved into Cargo, especially for large popular crates.

I’m also curious if it might be possible to pipeline the CGU stage so that you feed the data from rustc to LLVM incrementally as you lower MIR into LLVM IR without having the entire CGU upfront. That may help reduce total processing time by hiding brief IO bubbles that take up a lot of time in aggregate (yes it’s CPU bound, but there’s still going to be implicit IO like memory) and also reduce total peak memory usage for hopefully obvious reasons. To gain full benefit though you might need the entire thing to be pipelined all the way through (from generating the MIR to lowering to LLVM IR) and that may be at odds with things like optimized builds which need to do analysis from a more global perspective (which would similarly inhibit the peak memory usage gains).

clarkmoody 2 years ago |

An important part of the scientific process is reporting on failures. It would be great to see more of this sort of informed exploration, especially when good-sounding ideas produce surprisingly bad results.

llogiq 2 years ago |

Nick's articles are always a delight to read, and this one is no exception. Openly discussing failure is one part, but coming up with so many ideas and having the gumption to actually try them out on a compiler is really fascinating.

blovescoffee 2 years ago |

On your/the author's point about estimation of execution time, how valuable is that really? I ask because it seems like something a neural network approach could approximate. It should be relatively easy to generate a large dataset - just compile lots of libraries. Learning a function to approximate that in a small NN with a short inference time seems possible. If it's really one of the keys to getting compile time down, it's worth a shot.

mhh__ 2 years ago | |

Neural generation of profile guidance Data is already a thing.

sfink 2 years ago |

> Throwing machine learning at the estimation problem is a possibility, but I fear that it would still be difficult to come up with an estimation function that is (a) comprehensible and (b) accurate across many different machines.

Nick: you know me, I love to overcomplicate things.

I don't know if you would consider this to be machine learning, but the size estimation seems like a good fit[1] for multiple linear regression. You have some independent variables (number of MIR nodes, number of basic blocks) and an error in a dependent measure you care about (time to compile), so it seems worth running through a linear regression model to get rough weights for your independent variables. (eg 7 × number of BBs + number of MIR nodes or whatever.) Sure, there's the risk of overfitting to your machine and number of cores / amount of RAM / whatever. But if the size estimation is really what's throwing off your other experiments, it seems worth spending some time to come up with other possible inputs (independent variables) and cutting that error down.

You might consider using something other than squared error for the regression, since at this point you know a fair bit about the effects of different errors.

If you happen to know that the compiler does stuff that is quadratic in number of basic blocks, you might even want to consider a nonlinear term like the square of the number of basic blocks. (So instead of `time ~ nodes + BBs`, you'd have `time ~ nodes + BBs**2` or `time ~ nodes + BBs + BBs**2` but still just do linear regression.) The square of an independent variable could also be seen as a way of modeling "this factor matters, but only when it gets big". You can play with other powers like square roots, which are more like "the value of this factor matters, but large values aren't that different from each other".

But the more of those games you play, the more you're going to overfit, so you probably ought to extract weights from a subset of your data and withhold the rest for testing. It'll still overfit (to your computer, for example), but at least it'll be better.

Last thought: you've broken down compilation time into its different phases before, right? I'm thinking of things like register allocation. Is there an easy-to-compute independent variable related to that, like maybe "number of MIR instructions in regions of the control flow graph that have more than 4 live variables" or something? Probably not, that's probably only visible later. Number of loops / back edges in the AST? I don't know enough about how this stuff works.

[1] Pun not intended. Until I made it.

CrendKing 2 years ago |

> The staircase shape formed by the left-hand side of the LLVM threads is because the rustc thread does the MIR-to-LLVM-IR conversion one CGU at a time, and the LLVM thread for a CGU cannot be spawned until that conversion is complete.

Is there a particular reason why this can't be parallelized?

DamonHD 2 years ago |

Very good! This is exactly the kind or mire that makes parallelising / distributing workloads interesting to me!