Faster CRC32-C on x86(corsix.org) |
Faster CRC32-C on x86(corsix.org) |
clmul was designed as a more general crc32 accelerator for the SIMD instructions set. (It is sufficiently general to also do Reed Solomon and Elliptical Curves. ARM has an equivalent pmul instruction if you are curious).
The traditional methods are to either use crc32 instruction, or the CLMUL instruction.
This blog post uses both instructions for maximum speed. Modern processors can execute different pipelines in parallel. So by placing CLMUL and crc32 instructions next to each other, you get the parallel execution with high efficiency.
It is tricky to calculate crc32 in parallel using two different instructions / methodologies interleaved. But this blog post accomplishes that.
My personal thoughts is that we should design a CPU where these kinds of pipelines / executions are more explicit, and then write magic compilers that can pull parallelism out of our programs to be in the more explicit parallelism form that this new CPU would prefer. You'd still be tied to an architecture, but moving to a new architecture (ie: 2x SIMD pipelines in the future) would be as easy as recompiling, in theory.
Then I realized that I've reinvented VLIW / Intel Itanium. And that's a silly, silly place and we probably shouldn't go there again :-p
--------
The MIMD (multiple-instruction multiple data) abilities of modern CPUs are quite amazing in any case, and its always fun to take advantage of it. Even with a singular instruction stream like in this example, it is obvious that modern CPUs have gross parallelism at the instruction level.
Its a bit of a shame that these high-performance toys we write are kind of unsustainable... requiring in depth assembly knowledge and microarchitecture-specific concepts to optimize (that often become obsolete as these designs inevitably change every 5 years or so). Then again, its probably a good idea to practice writing code at this level to remind us that the modern CPU is in fact a machine with defined performance characteristics that we can take advantage of...
Classic example: Intel Haswell chips can only do addition on one of their execution units, so you double your throughput by doing a fused multiply-add on the other with a multiplicand of 1.
Classic example: For large enough matrix multiplications you can load your data in such a way that the problem really is CPU-bound, even if you have to load from disk. Double up your data size with an integer representation of the matrix and do some integer-backed emulated floating point instructions for a chunk of the matrix and floating point for the other. You reduce the factor in the O(n^(2.7..3)) and add some extra O(n^2) work and some extra space (assuming you don't waste too many instructions on full IEEE compliance and don't need it for that application).
And so on. It's a fun trick, and often the compiler isn't able to execute it effectively. The same idea applies with all but the most trivial loops; until recently (last 10 years or so?), gcc wouldn't interleave cumulative sums to reduce the data dependencies (just keep 4 running totals instead of 1, skip by 4 each loop, and merge the sums and any stragglers at the end -- distinct from loop unrolling in that the major gain is from better pipelining rather than just reduced loop overhead -- it still works with vectorized accumulators). Not that it matters if you're IO-bound (which you often are on that sort of problem), but for mediumish datasets the idea still applies.
In principle you could probably use some kind of symbolic math solving library to try to detect optimization opportunities like this in general code. In practice it just wouldn't be worth it, even at -O3, because it would add a ton of CPU and memory overhead to compilation and the optimization would be very rarely applicable in the first place.
Instead of the programmer manually thinking of the "two independent pipelines", its quite possible that we can imagine a language where the two pipelines were programmed separately, and then a compiler merges them together knowing about the pipeline details (ex: Skylake is 1-per-clock throughput / 3-clock latency, AMD might be different like 1-per-clock throughput / 4-clock latency or something).
The programmer's job would be to "separate out the two independent threads of execution", while the compiler's job is to "merge them back together into one instruction stream".
Much like how SIMD-code is written today, the programmer is responsible for finding the parallelism. The compiler / machine is responsible for efficient execution.
--------
As it is, today's high speed programmers have to do both tasks simultaneously. We have a mental model of the internal registers, pipelines, throughput, latencies of different execution units, and manually schedule them to match our mental model. (But that mental model changes as new CPUs come out every few years).
The hard part is figuring out the parallelism. I don't think we have a language that describes this fine-grained parallelism though, in any case. Just a "what if we lived in a magically perfect world" kinda hypothetical here...
EDIT: Alternatively, you can "cut dependencies" and hope that the compiler discovers your (intended) low dependency chain. IE: Manually unroll loops and whatever, which works better in practice than you might think (and such manual unrolling often seems to trigger the AVX autovectorizer in my experience, if you're lucky).
A VLIW-style architecture is found on signal processors and other dedicated number-crunchers, where people are actually going to write custom assembly routines fully exploiting the hardware. For example, the original NEC uPD7720 from 1980 has separate fields in its instruction word. You get two ALU operations (where the operands don't depend on each other) and an address operation per instruction, all performed in parallel in one clock cycle.
Intel would later take that idea for the i860 (long before Itanium) and the i860 can run in a mode where it always fetches two instructions per cycle, one integer and one floating point. Code generation with a compiler for this turned out to be very difficult to exploit.
Another ten years later, compiler technology has improved a lot, right? Maybe it'll work now. Cue Itanium. Ah. Maybe not. The idea is probably going to be revisited every decade or so until compiler technology has actually advanced enough to pull it off :P
You downloaded the system source code and recompiled everything targeted to your exact CPU specs.
When a CPU is designed around that, everyone needs to recompile each generation for maximum performance. The theory is that bytecode like Java recompiles efficiently though (or perhaps NVidia PTX bytecode, a SIMD bytecode that recompiles each NVidia GPU generation for a higher performance example with more support from the hardware).
------
Intel Itanium was a 200x era design that was supposed to be like this, but x86-64 from AMD ended up being faster in practice.
NVidia's PTX really changed things, as well as the habit of GPU programmers for writing very, very small "programs" (called kernels) that's managed by a separate chip (IE: cpu manages the kernels, compiles / calls them as appropriate, etc. etc.). It works out in GPU land, but maybe will never work in CPU land (unless CPU-land picks up upon this kernel-invoke abstraction? Intel ispc for example has the model, as does OpenMP target offload... and thread-pools and Go arguably have it as well)
Interpreted languages work by consolidating all of the optimization effort in the interpreter. This is similar to how CPUs work now, instead of extremely specific optimizations that are hard to create distributed among all code we use very general optimizations that push the limits of mathematics that is centralized in a CPU.
-----
Itanium had a lot of contemporary issues that made it not work. I would certainly blame Intel's business practices and reputation for a large part of it. There are likely niches for such processors. The VLIW is useful for DSP or graphics. Indeed, the only extant VLIW (that I know of) processor is the Russian Elbrus. I think the VLIW is only included to let them reuse a lot of the core logic of the CPU to drive a DSP engine, useful for radar and scientific simulation, though the sci sim would probably use commercial hardware which would be faster.
It works on GPUs because they're doing DSP, basically. We could have weirder topologies for GPUs however, like a massive string of ALUs driven off an embedded core, so you try to kachunk all your data in a single clock domain after configuring the ALU string.
https://arxiv.org/pdf/1804.06826.pdf
PDF Page 14 (physical page 12) shows that every instruction on NVidia Volta SASS has 4-bits of "Reuse flags", 6-bits of "wait barrier mask", 3-bits of "Read Barrier Index", 3-bits of "Write Barrier Index", 1-bit "Yield Flag" and 4-bits "Stall Cycles".
> Wait barrier mask; Read/Write barrier index. While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. Volta, Pascal and Maxwell use dependency barriers to track the completion of variable-latency instructions and resolve data hazards. When a variablelatency instruction writes to a register, the assembler associates it to one of the 6 available barriers by setting the corresponding write barrier number field. When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that barrier in the wait barrier mask. The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the wait barrier mask is a bitmask, not an index.
--------
It seems like NVidia has invented something very interesting here. I don't quite understand it myself, but its quite possibly what you're talking about.
__EVERY__ instruction, on NVidia machines Volta and newer (and one control-every 3 instructions on Pascal, and 7 on older Kepler systems). So these "bundles" could be seen as "better-Itanium" back in Kepler/Pascal, but perhaps the modern NVidia GPU core is fast enough to have such pre-compiled dynamic behavior / barrier interpretation every instruction these days.
It seems like a lot of instruction-bandwidth to eat up though, since each instruction on Volta is 128-bits / 16-bytes long since there's so much control information