In the 2 additions version, computation of the next iteration depends on the results of the preceding iteration. In the multiplication version, the computations are independent for each iteration, enabling parallel execution (by SIMD and/or pipelined/superscalar execution).
but im to lazy to test that
The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.
The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.
(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)
Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.
We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)
MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.
I had hoped that functional programming languages would lead us to automatic high-level parallelization. That hasn't happened much in the real world. But what we got is "write code like a functional programmer would even in a low-level language and the compiler and the ISA will parallelize it for you." That surprises me, but I'll take it.
How is MLIR different from LLVM IR?
And tile/block processing is not that novel for generic compute, graphics use it for other reasons - mostly how the pixel data is in memory and how the pipeline scales with tile size.
Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.
So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.
And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.
x86 is fundamentally a CISC; if you treat it like a RISC, it will definitely disappoint.
(Submitted title was "Multiplications and 2 additions are faster than 2 additions")
For the first version w/ AVX I get:
$ perf stat ./a.out [-] Took: 225634 ns.
Performance counter stats for './a.out':
247.34 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,009 page-faults:u # 8.122 K/sec
960,495,151 cycles:u # 3.883 GHz
2,125,347,630 instructions:u # 2.21 insn per cycle
62,572,806 branches:u # 252.982 M/sec
3,072 branch-misses:u # 0.00% of all branches
4,764,794,900 slots:u # 19.264 G/sec
2,298,312,834 topdown-retiring:u # 48.2% retiring
37,370,940 topdown-bad-spec:u # 0.8% bad speculation
186,854,701 topdown-fe-bound:u # 3.9% frontend bound
2,242,256,423 topdown-be-bound:u # 47.1% backend bound
0.247734256 seconds time elapsed
0.241338000 seconds user
0.004943000 seconds sys
For the second version with SSE and the data dependency I get:$ perf stat ./a.out [-] Took: 955104 ns.
Performance counter stats for './a.out':
975.30 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,010 page-faults:u # 2.061 K/sec
4,031,519,362 cycles:u # 4.134 GHz
3,400,341,362 instructions:u # 0.84 insn per cycle
200,073,542 branches:u # 205.140 M/sec
3,192 branch-misses:u # 0.00% of all branches
20,091,613,665 slots:u # 20.600 G/sec
3,110,995,283 topdown-retiring:u # 15.5% retiring
236,371,925 topdown-bad-spec:u # 1.2% bad speculation
236,371,925 topdown-fe-bound:u # 1.2% frontend bound
16,546,034,782 topdown-be-bound:u # 82.2% backend bound
0.975762759 seconds time elapsed
0.967603000 seconds user
0.004937000 seconds sys
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound.No, just because it's using SSE doesn't mean it's vectorized. SSE has both scalar and vector instructions.
SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.
-------
The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...
Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).
I feel like last 110-15 years (majority of) people have stopped thinking about specific CPU and only think about ISA. That works for a lot of workloads but in recent years I have observed that there is more and more interest in how specific CPU can execute code as efficiently as possible.
If you're interested in the kind of optimizations performed in the example you should check out polyhedral compilation (https://polyhedral.info/) and halide (https://halide-lang.org/). Both can be used to speed up certain workloads significantly.
double A4 = A+A+A+A;
double Z = 3A+B;
double Y1 = C;
double Y2 = A+B+C;
int i;
// ... setup unroll when LEN is odd...
for(i=0; i<LEN; i++) {
data[i] = Y1;
data[++i] = Y2;
Y1 += Z;
Y2 += Z;
Z += A4;
}
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.Just because the algebra works it doesn't guarantee that the code does, too.
E.g. if you use doubles for the range of 32bit integers even adding 1.0 2 billion times to 2 billion still ends up at 4 billion. Even adding 1.0 20 billion times to 20 billion ends up at 40 billion. Now adding 0.1 20 billion times to 2 billion ends up 1908 short on my CPU, i.e. about 19080 absorptions/rounding errors occured. You need some serious differences and amount of operations to actually trigger errors.
As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).
Have fun threading the needle!
The power consumption is a good question.
You'd need to parallelize it explicitly (which can be done by just unrolling the loop).
Second: why are so many people insisting that the loop is auto-vectorised? Is there any evidence to that? Data dependencies alone explain the observed performance delta. Auto-vectorization would have resulted in a higher speedup.
The suggested "slower" optimisation does fundamentally use less instructions. Chucking more hardware at parallelisable problems makes it run faster but does not necessarily reduce the power requirements much because there are fundamentally the same number of instructions, it's just gobbling up the same power over a shorter period of time - The "slower" serial algorithm uses less instructions and in theory less power in total. Disclaimer: I mean power vs speed fundamentally in theory, in practice there may be no measurable difference depending on the particular CPU and code.
And how far would an optimiser typically take its analysis? For example, if B was defined inside the loop as (non -const) A * 2 ? Or as A * f() , where f() always returns 2, maybe using a constant or maybe a calculation etc.
Seems like a very hard problem,
Is anyone aware of good tooling for automatically catching this sort of thing even some of the time?
And... the two codes runs with exactly the same speed, on M1 Mac, with go.
edit: of course, with go, you can manually parallelize the faster option with goroutines... but that does something else, doesn't it. (and it's 500x faster.)
(Super low energy processors, battery powered, etc?
This isn't actually taking advantage of speculative execution that much. The only speculation here would be in the predicting the loop repeats, which loop unrolling would mostly negate for CPUs that don't do speculative execution.
The data dependency issue, however, would still be a punishing factor. You'd need a CPU that isn't superscalar, which does exist but is increasingly less common (even 2014's Cortex-M7 was superscalar, although it kinda sounds like ARM backed off on that for later Cortex M's?)
Also many low-end / embedded CPUs that are in-order will still do branch prediction.
The general advice when I worked with CG was to use deltas in iterations.
I'm taking the definition of simple and complex (or complected) from this talk[0]. In short: Simple means individual pieces are standing on their own, not necessarily easy or minimal. Complex means individual pieces are braided together into a whole.
The first code is simpler than the second. Just have a look at the two solutions and you see what I mean: The individual instructions in the first stand on their own and clearly declare what they mean as well. The second example is more complex, because each line/subexpression cannot be understood in isolation, there is an implicit coupling that requires the programmer _and_ the machine code interpreter to understand the whole thing for it to make sense. The CPU apparently fails at that and cannot optimize the machine instructions into more efficient microcode.
The example illustrates performance benefits of simpler code, but there are others too:
For example a type checker might be able to infer things from simpler code, but not from complex code. Again, complexity is about coupling, which can for example arise from making assumptions about things that are out of bounds or logic that happens in some other part of your program, which _this_ part relies on.
Things like these can either be outright rejected by a type checker or simply ignored and sometimes we can provide additional ceremony to make it happy. But there is always an underlying question when these issues arise: Can I express this in a simpler way? It's sometimes possible to describe rich semantics with simpler, more primitive types and explicit coordination.
Another thing that comes to mind are rich ORMs. They can be very convenient for common cases. But they have footguns for the general case, because they _complect_ so many things, such as validation, storage, domain rules, caching etc. And they are leaky abstractions because we have to do certain things in certain ways to avoid bloated SQL, N+1 etc.
They are not simple (and certainly not declarative) so we program against a black box. There are simpler "ORMs" that I very much like, but they typically only provide a minimal set of orthogonal features, such as query builders, and mapping results into plain data structures. It is simpler to use these kind of orthogonal tools.
Last but not least: Simple code can be pulled apart and changed more easily without having to worry about introducing problems. The substitution rule or referential transparency enables this by guaranteeing that each expression can be replaced by the value it will evaluate to. This also implies that other expressions that evaluate to the same value can be substituted freely.
[0] Simple Made Easy - Rich Hickey at Strangeloop 2011 https://www.youtube.com/watch?v=LKtk3HCgTa8
EDIT: Solvable by a human. I dunno much about compilers though, so I dunno if there's some kind of language issue that would prevent the unrolling of this dependency.
Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.
But when operating on floating point numbers, == is not bitwise, and -0 == +0.
Check out data oriented design if you aren’t already familiar.
If the compiler says autovectorization failed, you rewrite the code until the autovectorizer works.
https://docs.microsoft.com/en-us/cpp/build/reference/qvec-re...
You can write code using explicit vectors and it’s not too bad, though less cross platform than you’d like.
An actual answer involves things like restrict and alignment keywords.
Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Or in other words: i+7 can depend on i-1 no problems.
There is also "parallel" adds (addpd).
Carefully look at the assembly language, the 1st version uses parallel adds (addpd) and parallel multiplies. The 2nd version uses scalar adds (addsd)
The other major point is that the 2nd version uses a singular move qword (64-bit) per loop iteration, while the 1st version is using the full 128-bit move per loop iteration.
---------
SSE is used for scalar double-precision these days, because scalar-SSE is faster than x87 instructions... and better matches the standards (x87 had "higher precision" than the IEEE specs, so it has different results compared to other computers. SSE is closer to the specs)
That's not the tldr. It's actually more wrong than right.
The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.
Long explanation:
Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!
Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.
So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.
Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.
[1] https://www.felixcloutier.com/x86/addpd
[2] https://www.felixcloutier.com/x86/addsd
[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
While it might sound intuitive that SIMD instructions consume more power, I don't think that's necessarily true to a relevant degree in practice. My understanding is CPU power consumption is mostly tied to inefficiences that cause energy loss via heat, while the actual computation doesn't consume any energy per se. So electrons traveling a more complex path probably cause somehwat more energy loss as there is more wire/transistors to pass. But most of the total loss doesn't actually occure in the ALU. Empirically from what you can see operating systems do, the most effective way of consuming less power is actually running on a slower clock cycle and the most effective way to achieve that is getting work done faster and that's not tied to the number of instructions.
The Stackoverflow question here [1] seems to suggest that SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
[1] https://stackoverflow.com/questions/19722950/do-sse-instruct...
> SIMD vs no SIMD has a neglectable overhead compared to entering a lower power state sooner.
Yes on the same CPU as I suggested, real world difference may be unmeasurable. However note that this particular case is interesting because it's not comparing fewer serial multiplies to more SIMD multiplies, it's comparing SIMD multiplies to no multiplies but with a serial constraint due to variable dependence... i.e it's SIMD vs no multiply without any other difference in number or type of ALU ops... which again could make no difference on a big x86 in practice, but it would be interesting to know.
All of this changes if you are coding for a lower power device and have choice of hardware.
Even in this example, which apparently has 4x vector instructions vs 1x scalar, AVX-512 (and probably even AVX2) would reduce energy usage because they can do 8 (or 4 for AVX2) 64-bit elements per cycle.
Good point, decoding and scheduling are expensive and SIMD certainly eliminates a lot of that. However the alternative algorithm has even less decoding and scheduling to do, since it completely eliminates the multiplication operations without increasing the number of additions. But even then I wouldn't be surprised that it makes no difference to energy on any x86, as I said it was more of a fundamental observation - for a different application this is useful when selecting the actual hardware if you are not already constrained to a particular chip.
The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC.
I.e. the CPU doesn't use much less power if most ALUs are idle, so you might as well use all of them to compute redundant results -> and then turn ~everything off sooner.
>> this is a perfect example of how simpler code should be preferred _by default_ over complex code.
Extra emphasis on *by default*.
You are both on the same side of the coin. This is the essence of "premature optimization is the root of all evil". Simplicity should be the default. Once you know your implementation is solid, then you can go back and optimize.
TFA is quite contrived, but imagine a much more complex process. You see some code:
for i in loop:
foo[i] = myfunc(foo[i], bar[i])
You have to go in and dig into why exactly foo is being fed back in at every step. But if instead you saw: for i in loop:
foo[i] = myfunc(bar[i])
You immediately know you have more options at your disposal to optimize that loop. Further, the compiler immediately knows it has more options. Maybe this is a good option for automatic unrolling or Duff's device. Maybe you wanna send this to a gpu shader, or use a macro to parallelize the loop. But it's a lot harder to get to that point if your starting point is the complected, stateful one.I recall that Scala was the first language that introduced FP to me, and they made the developer explicitly tell that a functional operation could be done in parallel, after which the compiler and runtime would take care of the rest.
This was implemented in the extreme in a project that allowed you to write a functional style series of operations, which were translated to Hadoop map and reduce jobs, each of which could be executed in parallel at large scale. So very compact and succinct code, but very powerful.
The relative novelty and paucity of functional languages in the ecosystem means the incentive story hasn't happened to get the end-to-end from compiler through to execution so tight yet. I expect it'll happen, and when it does I expect the result will trigger fewer questions like this one because a newer generation of programmers well be less inclined to assume serial execution.
[1] "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation", Lattner and Adve. https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf
[2] “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation”, Lattner et al. https://ieeexplore.ieee.org/document/9370308
I asked around at the time and someone mentioned that I might have overtaxed certain execution ports or something like that, but yeah that just cemented my belief that x86 optimization is not my cup of tea anymore. Better to spend time learning how to write code the compiler can optimize well.
These things are knowable if you have enough curiosity and maybe masochism :)
Nowadays, counting cycles is a game for monks who don't actually have to get anything done.
(But CPUs are usually better than you think they are.)
I'm wondering how the incentives play out to keep this stuff private?
Loop-carried dependency is the big culprit here. I wish we had a culture of writing for loops with the index a constant inside the loop, as in the Ada for statement, and not the clever C while loops or for with two running variables... Simpler loop syntax makes so many static analyses 'easier' and kind of forces the brain to think in bounded independant steps, or to reach for higher level constructs (e.g. reduce).
Edit: And many modern compilers have been doing that particular optimization for a few years anyway, but it's still an important idea to keep in mind for any non-trivial graph of operations.
Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast
I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still.
The compiler won't do it without fast math.
And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with.
I encourage you to read my other comment here https://news.ycombinator.com/item?id=31551375
edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).
And of course one reason we say premature is that most likely until the project is mostly finished you can't really measure because you don't have anything to measure.
That’s only part secrecy and part to give them freedom to change it. It is of course somewhat described in their patents.
I honestly don't know if it's worth it to try to optimize branch prediction in compilers these days, beyond the obvious step of putting the highest probability target next (for fallthrough prediction) and generally laying out hot parts of the code together. TurboFan and most other dynamically-optimizing compilers put rare code at the end of functions, and that's a huge boost.
Well, you were and I wasn't, and it wasn't clear whether you meant to contradict my point. I guess I could've made it more clear that I meant "using == to denote bitwise equality here only" and not "btw, == is bitwise for floats", but either way it should be clear now.
The compiler would have to test for the x == -0 case, and doing that ¿rarely/never? is faster than computing x + y.
> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.
void compute()
{
const double A = 1.1, B = 2.2, C = 3.3;
const double A128 = 128*A;
double Y[8], Z[8];
Y[0] = C;
Y[1] = A + B + C;
Y[2] = 4*A + 2*B + C;
Y[3] = 9*A + 3*B + C;
Y[4] = 16*A + 4*B + C;
Y[5] = 25*A + 5*B + C;
Y[6] = 36*A + 6*B + C;
Y[7] = 49*A + 7*B + C;
Z[0] = 64*A + 8*B;
Z[1] = 80*A + 8*B;
Z[2] = 96*A + 8*B;
Z[3] = 112*A + 8*B;
Z[4] = 128*A + 8*B;
Z[5] = 144*A + 8*B;
Z[6] = 160*A + 8*B;
Z[7] = 176*A + 8*B;
int i;
for(i=0; i<LEN; i+=8) {
data[i ] = Y[0];
data[i+1] = Y[1];
data[i+2] = Y[2];
data[i+3] = Y[3];
data[i+4] = Y[4];
data[i+5] = Y[5];
data[i+6] = Y[6];
data[i+7] = Y[7];
Y[0] += Z[0];
Y[1] += Z[1];
Y[2] += Z[2];
Y[3] += Z[3];
Y[4] += Z[4];
Y[5] += Z[5];
Y[6] += Z[6];
Y[7] += Z[7];
Z[0] += A128;
Z[1] += A128;
Z[2] += A128;
Z[3] += A128;
Z[4] += A128;
Z[5] += A128;
Z[6] += A128;
Z[7] += A128;
}
}Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.
I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.
I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.
So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.
One of these days, I'll actually remember that fact, lol.
4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.
2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.
I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?
I really don't see how that works in improving this.
You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.
I somehow got tripped up by the 4xSIMD. I was assuming you meant it's using 4x 64bit SIMD there which it doesn't. mulpd and addpd are 2x 64bit, also visible by the xmm instead of ymm registers.
I got sloppy on the difference between all instructions including the loop logic vs just the instructions necessary to do the main computation. Obviously the first is the correct measure and I was sloppy.
https://www.intel.com/content/www/us/en/developer/articles/n...
I think software is not a huge profit center for them.
The original comment presents:
> The actual details there are [1] too secret for Intel to want to accurately describe them in gcc, [2] they’re different across different CPUs, [3] and compilers just aren’t as good as you think they are.
2 and 3 could just be the whole story. Although we haven't actually accumulated any evidence here for 3, given that the original story was about surprisingly getting beat by a compiler, despite performing a seemingly obvious optimization.
Actually, compiler optimizations like scheduling tend to be neutral to negative on x86 because they increase register pressure. You’d probably want to do “anti-scheduling” and hope the CPU decoder takes care of it if anything.
Interesting that it generates suboptimal code for non-Intel processors.