Apple’s M1 processor and the full 128-bit integer product

Apple’s M1 processor and the full 128-bit integer product(lemire.me)

230 points by tgymnich 5 years ago | 175 comments

titzer 5 years ago |

Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle on average. 1 cycle on average for any kind of loop is just flat out suspicious.

This requires a lot more digging to understand.

Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.

ascar 5 years ago | |

This is the benchmarking loop:

  for (size_t i = 0; i < N; i++) {
    out[i++] = g();
  }

N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.

This is also visible in the assembly

  add     x8, x8, #2

So if I see this correctly the results are off by a factor of 2.

[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

gpderetta 5 years ago | | |

Yes, the i++ seems an oversight.

The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.

ascar 5 years ago | | |

> Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

The article got updated by now :)

HelloNurse 5 years ago | | |

A C for statement is a "benchmarking loop" in the same sense that "slice a sponge cake into two layers and place custard in the middle" is an actionable dessert recipe.

pjc50 5 years ago | |

Failing to post disassembly for a micro benchmark is annoying.

It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.

Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.

brigade 5 years ago | |

The dependency chain is state += 0x60bee2bee120fc15ull or (state += UINT64_C(0x9E3779B97F4A7C15)); the rest of the calculations are independent per iteration.

Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.

titzer 5 years ago | | |

It's a shame we can't see the rest of the code. What is happening to the result value? Is it being compared to something? Put into an array, or what? All of that code probably totally outweighs what you pointed out here. Or, at least it should. I have a bad feeling it might be being dead-code eliminated, since compilers are super aggressive about that nowadays, but I hope he's somehow controlled for that.

echlebek 5 years ago | |

The conclusion seems based on the relative execution times for the two benchmarks. Since the benchmarks are measured in the same way, their error bars should be basically the same as well. This analysis is not an analysis of the absolute execution time of these algorithms, but the difference between them.

I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".

titzer 5 years ago | | |

> Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".

But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.

mhh__ 5 years ago | |

At that throughput the CPU is speculating and exploiting the access pattern.

It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)

baryphonic 5 years ago | |

Totally agree. I was thinking he'd get there, and then the post abruptly ended.

p1mrx 5 years ago |

RISC-V does this too: https://five-embeddev.com/riscv-isa-manual/latest/m.html

"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."

namibj 5 years ago |

For the interested, LLVM-MCA says this

    Iterations:        10000
    Instructions:      100000
    Total Cycles:      25011
    Total uOps:        100000

    Dispatch Width:    4
    uOps Per Cycle:    4.00
    IPC:               4.00
    Block RThroughput: 2.5

    No resource or data dependency bottlenecks discovered.

, which to me seems like 2.5 cycles per iteration (on Zen3). Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.

For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):

    .LBB5_2:                                # =>This Inner Loop Header: Depth=1
    mov     rdx, r11
    add     r11, r8
    mulx    rdx, rax, r9
    xor     rdx, rax
    mulx    rdx, rax, r10
    xor     rdx, rax
    mov     qword ptr [rdi + 8*rcx], rdx
    add     rcx, 2
    cmp     rcx, rsi
    jb      .LBB5_2

thesz 5 years ago |

Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.

Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.

tzs 5 years ago |

I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?

How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).

sgtnoodle 5 years ago | |

It seems like something that would be arbitrary depending on how the optimization was implemented. There wouldn't be an inherent need for that amount of generalization. Apple can tightly control their compiler to follow the rules, and there seemingly wouldn't be any compelling reason not to stick those two instructions back to back in a consistent order, since the second instruction is effectively free.

It would be fun to experiment with, for someone that has the hardware. My guess is that swapping the order will make it slower, but adding an independent instruction or two between them probably won't have a measureable effect. It would be fun to try and consistently interrupt the CPU between the two instructions as well somehow, to see if that short-circuits the optimization.

thefourthchime 5 years ago |

I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.

acje 5 years ago |

So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren't religious about these things.

yuhong 5 years ago |

I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.

Daho0n 5 years ago |

The amount of bugs in the M1 and MacOS posted on HN in a week could keep developers working for months at Apple.

mmaunder 5 years ago |

A common misconception about RISC processors.

chrisseaton 5 years ago | |

It’s not a RISC thing - CISC implementations do exactly the same kind of fusion for similar pairs of operations.

userbinator 5 years ago | | |

It has a bit less gain on a RISC due to the code density (or lack thereof), since it requires more fetch bandwidth. Apple works around this by using a very wide front-end: https://news.ycombinator.com/item?id=25257932

LAMike 5 years ago |

Anyone want to take a guess at how long it will be until Apple has their own fab in the US making M1 chips?

ben_bai 5 years ago |

That's great if you App is compute bound. "May all your Processes be compute bound." Back in the real world most of the time your Process will be io bound. I think that's the real innovation of the M1 chip.

gpderetta 5 years ago | |

Exactly because of the "real world" argument, turns out that a lot of actual real world loads are CPU bounds because they are so wastefully implemented. IO of all kinds has extremely high bandwidth these days and OoO helps hide the latency.

isitdopamine 5 years ago | |

Explain please. What does the M1 do to IO loads?

1_player 5 years ago | | |

Nothing. Compute speed isn't that important if you're waiting on IO is GP's point.

K0balt 5 years ago | | |

On die memory and storage. No bottlenecks, very little latency.

Ar-Curunir 5 years ago | |

128-bit muls really help speed up finite field impl, which speed up elliptic curve crypto. That’s one crucial place where faster code helps.

zelon88 5 years ago |

You mean to tell me that a $2000 Macbook is almost as performant as a $1000 PC? Tell me more!

neogodless 5 years ago | |

Based on U.S. prices, it's more like $999 vs $609 for similar specs (but no doubt a nicer machine and much better screen/touchpad with the Air.)

https://www.apple.com/shop/buy-mac/macbook-air

https://www.amazon.com/Lenovo-IdeaPad-Laptop-Newest-Display/...

zelon88 5 years ago | | |

The <10 second comparison I did was between an ASUS A15 and a 16" MacBook. $1000 vs $2500.

mpweiher 5 years ago | |

Both Minis and Airs start at under $1000, and they're all the same speed.

immigrantsheep 5 years ago | | |

Air starts at $1500 if you're not in the USA