Faster CRC32-C on x86

107 points by rostayob 3 years ago | 21 comments

In short: there are three 32-bit crc32 pipelines on modern Intel CPUs, but also a clmul (carry less multiplication) instruction in AVX on a separate pipeline.

clmul was designed as a more general crc32 accelerator for the SIMD instructions set. (It is sufficiently general to also do Reed Solomon and Elliptical Curves. ARM has an equivalent pmul instruction if you are curious).

The traditional methods are to either use crc32 instruction, or the CLMUL instruction.

This blog post uses both instructions for maximum speed. Modern processors can execute different pipelines in parallel. So by placing CLMUL and crc32 instructions next to each other, you get the parallel execution with high efficiency.

It is tricky to calculate crc32 in parallel using two different instructions / methodologies interleaved. But this blog post accomplishes that.

corsix 3 years ago | |

The math is fiddly to get right, but (as the author) I'd suggest that the disadvantage is very tight coupling to the CPU implementation: the interleaving is based on the relative speeds of the two methodologies, so if the relative speeds of the two methodologies drastically changes on a future CPU implementation, this _particular_ interleaving could end up _slower_ than either methodology on its own.

dragontamer 3 years ago | | |

Indeed.

My personal thoughts is that we should design a CPU where these kinds of pipelines / executions are more explicit, and then write magic compilers that can pull parallelism out of our programs to be in the more explicit parallelism form that this new CPU would prefer. You'd still be tied to an architecture, but moving to a new architecture (ie: 2x SIMD pipelines in the future) would be as easy as recompiling, in theory.

Then I realized that I've reinvented VLIW / Intel Itanium. And that's a silly, silly place and we probably shouldn't go there again :-p

--------

The MIMD (multiple-instruction multiple data) abilities of modern CPUs are quite amazing in any case, and its always fun to take advantage of it. Even with a singular instruction stream like in this example, it is obvious that modern CPUs have gross parallelism at the instruction level.

Its a bit of a shame that these high-performance toys we write are kind of unsustainable... requiring in depth assembly knowledge and microarchitecture-specific concepts to optimize (that often become obsolete as these designs inevitably change every 5 years or so). Then again, its probably a good idea to practice writing code at this level to remind us that the modern CPU is in fact a machine with defined performance characteristics that we can take advantage of...

hansvm 3 years ago | |

This sort of thinking can unlock up to roughly a factor of two in a lot of architectures and a lot of operations (depending on cache friendliness, instruction decoding, and other such things that might interfere with apparent wins).

Classic example: Intel Haswell chips can only do addition on one of their execution units, so you double your throughput by doing a fused multiply-add on the other with a multiplicand of 1.

Classic example: For large enough matrix multiplications you can load your data in such a way that the problem really is CPU-bound, even if you have to load from disk. Double up your data size with an integer representation of the matrix and do some integer-backed emulated floating point instructions for a chunk of the matrix and floating point for the other. You reduce the factor in the O(n^(2.7..3)) and add some extra O(n^2) work and some extra space (assuming you don't waste too many instructions on full IEEE compliance and don't need it for that application).

And so on. It's a fun trick, and often the compiler isn't able to execute it effectively. The same idea applies with all but the most trivial loops; until recently (last 10 years or so?), gcc wouldn't interleave cumulative sums to reduce the data dependencies (just keep 4 running totals instead of 1, skip by 4 each loop, and merge the sums and any stragglers at the end -- distinct from loop unrolling in that the major gain is from better pipelining rather than just reduced loop overhead -- it still works with vectorized accumulators). Not that it matters if you're IO-bound (which you often are on that sort of problem), but for mediumish datasets the idea still applies.

323 3 years ago | |

This raises the question: how many 32-bit crc32 pipelines do modern AMD CPUs have?

corsix 3 years ago | | |

https://uops.info/html-instr/CRC32_R64_R64.html answers that for you. Zen2 and Zen3 same as Intel: latency 3, throughput 1. Older AMD chips less good.

nynx 3 years ago | |

I think it’s that there’s one crc32 pipeline, but it can have up to three ops running through it at once, at different stages. I could be misreading the article though!

MrFuchs 3 years ago | | |

Indeed, only one port supports the corresponding instruction on Intel CPUs.

jscipione 3 years ago |

Is there a license associated with this code that I could use to include in an MIT licensed open source project? There are a couple of file system drivers that could use a faster crc32 check.