Replacing 32-bit loop variable with 64-bit introduces performance deviations

Replacing 32-bit loop variable with 64-bit introduces performance deviations(stackoverflow.com)

144 points by thisisnotmyname 11 years ago | 22 comments

mgraczyk 11 years ago |

To elaborate on the justification for the answer:

    So Intel probably shoved popcnt into the same category to keep the processor design simple

In the processor design I work on, we do register dependency checks by partitioning all instructions into a set of "timing classes" and checking the dispatch delay needed between dependent register producers and consumers across all possible timing class pairs. The delays vary depending on available forwarding networks, resource conflicts, etc. Often times we groups instructions into sub optimal timing classes to simplify other parts of the design or just to make the dispatch logic simpler.

Intel's x86 core is waaaaay more complicated than the core I work on and has far more instructions, so I it's probably safe to say that they make these suboptimal classifications often. I strongly suspect that the false dependency was intentional and not a "hardware bug" as some of the StackOverflow comments seem to suggest.

userbinator 11 years ago | |

I wouldn't classify it as intentional nor a "bug"; probably it's more of an oversight, as it's mentioned in the article that AMD's CPUs don't have this issue. Intel should definitely be made aware of this.

We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

On the other hand, MOV doesn't read both operands either.

caf 11 years ago | | |

Reg-Reg MOV doesn't use an ALU, though.

It would be interesting to see if the Intel C Compiler knows about this false dependency.

seanmcdirmid 11 years ago | |

All X86 ops are translated into very simple (RISCy) micro-ops before being scheduled, so the problem probably lies in that part of the processor.

asuffield 11 years ago | | |

Even if the problem isn't there, it's really easy to fix in that layer: just insert an instruction before popcnt that kills the value in the destination register, and there won't be anything to wait for. Intel does regular microcode updates to fix this sort of thing, so I would anticipate seeing this one get fixed in the not-too-distant future.

tofof 11 years ago |

TLDR: Headline (and indeed bulk of article) is phantom symptom. True cause is register allocator behavior.

Specifically, allocator's handling of an instruction with a false dependency on register that's written to, coupled with multiple compilers being unaware of the false dependency.

PythonicAlpha 11 years ago | |

Maybe one should add, that (as much I understood) it is a problem of the processor handling one specific (and rare) instruction. It does assume register dependencies that do not exist. It was shown, that AMD does not have this behavior. And it shows, that today's processors are enormous complex beasts.

The problem with the compilers was, that they where not aware of this behavior and thus generated sub-optimal code for this situation ... but compiler builders are also mere humans.

shmerl 11 years ago | | |

Did they file a bug for gcc and clang?

jbondeson 11 years ago |

This is why micro-benchmarking is Russian roulette.

When you distill a loop until you're finding the exact bottleneck in the system (pipelining, branch prediction, etc) you need to be very very careful you're measuring what you think you are. Otherwise you'll end up in this situation where you're benchmarking a compiler...

byuu 11 years ago |

I suppose similarly related to this, when I was keeping track of synchronization between two cooperative simulation threads running at different frequencies, I had a 64-bit signed integer: chip A would add chip_B_frequency * chip_A_cycles_executed; and chip B would subtract chip_A_frequency * chip_B_cycles_executed. If the value was >=0, chip A was ahead and would switch to B; and if the value was <0, chip B was ahead and would switch to A.

I ended up getting a noticeable speed boost just by using sync += (uint32_t)clocks * (uint64_t)frequency; ... just a simple 32-bit x 64-bit multiply was quite a bit faster than a 64-bit x 64-bit multiply. (One had to be 64-bit to prevent the multiplication from overflowing, as one value was in the MHz range and the other could be up to ~2000 or so.)

I've observed this on both AMD and Intel amd64 CPUs. Not sure how that'd hold up on other CPUs. As always though, profile your code first, and only consider these types of tricks in hot code areas.

userbinator 11 years ago |

It should be noted that using 64-bit operands, even in 64-bit mode, incurs an extra penalty of 1 byte per instruction, for the REX prefix. The same applies to using the extended registers (the uncreatively-named "r8" through "r15".) This is very much not noticeable for microbenchmarks, where all the code of a loop fits in the cache, but for bigger ones, the effects of icache misses can become quite significant. A smaller instruction sequence that is slower than a larger one when microbenchmarked can become much faster once that code is benchmarked as part of a whole system.

nitrogen 11 years ago | |

(the uncreatively-named "r8" through "r15".)

I'd much rather have numbered registers that can be used for anything than named registers that have usage limitations.

colanderman 11 years ago | | |

I suspect that aside was written tongue-in-cheek.

gioele 11 years ago | |

> incurs an extra penalty of 1 byte per instruction, for the REX prefix.

Any hope to see a Thumb mode for x86-64?

frozenport 11 years ago |

Hoe can you fix this in VS where there is no way to finely target a.CPU?

nate@sandybridge:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency nate@sandybridge:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.608615 sec 17.2289 GB/s uint64_t 41959360000 0.82312 sec 12.739 GB/s nate@sandybridge:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency nate@sandybridge:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.182781 sec 57.3679 GB/s uint64_t 41959360000 0.182638 sec 57.4128 GB/s nate@haswell:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency nate@haswell:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.401225 sec 26.1343 GB/s uint64_t 41959360000 0.75841 sec 13.826 GB/s nate@haswell:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency nate@haswell:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.0843861 sec 124.259 GB/s uint64_t 41959360000 0.0842836 sec 124.41 GB/s

nate@sandybridge:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.517827 sec 20.2495 GB/s uint64_t 41959360000 0.518041 sec 20.2412 GB/s nate@haswell:~/tmp$ popcnt-dependency 1 unsigned 41959360000 0.351273 sec 29.8507 GB/s uint64_t 41959360000 0.352914 sec 29.712 GB/s