This requires a lot more digging to understand.
Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
for (size_t i = 0; i < N; i++) {
out[i++] = g();
}
N is 20000 and the time measured is divided by N. [1] However, that loop has two increments and only computes 10000 numbers.This is also visible in the assembly
add x8, x8, #2
So if I see this correctly the results are off by a factor of 2.[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
The relative speed between the two hashes is still the same, but it is no longer one iteration per cycle.
The article got updated by now :)
It is of course speculating all the way through the loop; a short backwards conditional branch will be speculated as "taken" by even very simple predictors.
Op fusion is very likely, as is register renaming: I suspect that "mul" always computes both products, and the upper one is left in a register which isn't visible to the programmer until they use "mulh" with the same argument. At which point it's just renamed into the target register.
Anyway, the more important fact is that 64x64b -> 128b mul might be one instruction on x86, but it's broken into 2 µops. Because modern CPUs generally don't design around µops being able to write two registers in the same set.
I don't think the conclusion is hasty. Lemire is saying: "look, if the M1 full multiplication was slow, we'd expect wyrng to be worse than splitmix, but it isn't".
But that doesn't follow either. Only by inspecting the machine code do we get to see what's really going on in a loop, and the ultimate result is dependent on a lot of factors: if the compiler unrolled the loop (here: no), whether there were any spills in the loop (here: no), what the length of the longest dependency chain in the loop is, how many micro-ops for the loop, how many execution ports there are in the processor, and what type, the frontend decode bandwidth (M1: seems up to 5 ins/cycle), whether there is a loop stream buffer (M1: seems no, but most intel processors, yes), the latency of L1 cache, how many loads/stores can be in-flight, etc, etc. These are the things you gotta look at to know the real answer.
It's also worth saying that if Apple were dead set on throughput in this area they could've implemented some non-trivial fusion to improve performance. I don't have an M1 so I can't find out for you (and Apple are steadfast on not documenting anything about the microarchitecture...)
"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."
Iterations: 10000
Instructions: 100000
Total Cycles: 25011
Total uOps: 100000
Dispatch Width: 4
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 2.5
No resource or data dependency bottlenecks discovered.
, which to me seems like 2.5 cycles per iteration (on Zen3).
Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):
.LBB5_2: # =>This Inner Loop Header: Depth=1
mov rdx, r11
add r11, r8
mulx rdx, rax, r9
xor rdx, rax
mulx rdx, rax, r10
xor rdx, rax
mov qword ptr [rdi + 8*rcx], rdx
add rcx, 2
cmp rcx, rsi
jb .LBB5_2Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
It would be fun to experiment with, for someone that has the hardware. My guess is that swapping the order will make it slower, but adding an independent instruction or two between them probably won't have a measureable effect. It would be fun to try and consistently interrupt the CPU between the two instructions as well somehow, to see if that short-circuits the optimization.
https://www.apple.com/shop/buy-mac/macbook-air
https://www.amazon.com/Lenovo-IdeaPad-Laptop-Newest-Display/...
Worse, for some of us when it does finally wake up the monitor, sometimes it wakes it up with all the wrong colors, and rebooting is the only reliable fix. (and before anyone asks, yes, I tried a different HDMI cable)
It's much faster if the monitor has been used recently, though, so I always figured it was the monitor that was causing the delay by going into some deep sleep state?
I don't have any performance issues waking up though.
Have you tried a USB-C/thunderbolt cable/controller tho?
Sometimes this will happen multiple times per page load if I deselect and reselect the password field.
I suspect it’s because I have five monitors and 20 million pixels (actually more as that’s the post-retina resolution).
Rendering a FPS game at 1080p is 2 million pixels per frame. At 60fps, that's rendering 120 million pixels per second.
What am I missing?
Of course for general tasks it was slower, but I really remember that thing waking up instantly when I raised the lid, every time.
(It's still way faster than the same set of apps on an Intel Mac laptop, where it could sometimes take on the order of 30 seconds to get to a usable desktop after a long sleep. On Intel Macs it seemed more obvious that the GPU was the bottleneck)
I have buggy apps (like Facebook Messenger) locking up, but I guess that's normal, I just uninstall them.
Maybe desktop platforms sleep differently than laptops?
I do occasionally have an issue where the brightness on the built in display is borked and won’t adjust back to the correct level for anywhere between 30s to a few minutes.
And then I don’t know if it’s my monitor or the M1, but sometimes there will be a messed up run of consecutive pixel columns about 1/10th of the screen wide starting about 30% from the left of the display. The entire screen in that region is shifted a few pixels upwards. Sometimes it’s hard to notice it but once you do it can’t be unseen. Replugging the monitor into the M1 resolves the issue.
Because Apple has a lot of capital and they wouldn’t need to compete as hard for their share of tsmc production capacity.
Even then, do Apple use enough chips to justify running a fab, let alone one that would be locked into the node of the time. I really don't see it happening for many reasons and the only reason they would - would be some tax break incentive to onshore some of the money they have offshore in that it pays for itself, win or fail.
I don't think that's relevant anymore. My understanding is that the 2017 TCJA required prior unrepatriated earnings to be recognized and taxed over eight years (so still ongoing) and future foreign earnings not subject to US tax (except if the foreign tax is below the corporate alternative minimum tax rate). As a result of those changes, there's no need to hold cash offshore.
The 16" MacBook Pro is not an M1.
The Air starts at $999, the mini at $699 (official list from Apple itself), $899/$679 education.
I guess the only drawback compared to MB Air is that it's a bit heavier.
edit: I love I how I keep getting downvoted on HN if I dare say anything about the M1. Even if it's the true like the price of the machine.
I just checked and it appears that notebooksbilliger.de sells the Air M1 starting at 1057 EUR and is willing to ship to Croatia for 30 EUR. If you were interested, maybe that's a better alternative.
Edit: Amazon.de charges 1079 EUR and seems happy enough to ship to consumers in Croatia as well for around 14 EUR. I haven't tried completing an order, obviously, but there are no relevant restrictions listed.
My experience maintaining projects is actually not that people don't provide enough info in their bugs (certainly true for random forum rants though), but when they try they try too hard and end up spending a long time writing a bunch of stuff I don't even read, because the log speaks for itself.
In this case you're not necessarily reaching an engineer, it could go to someone who combines reports together, or the fix is because of your report but it doesn't get communicated back properly, but it's still letting someone know it's a problem.
This may prompt me to upgrade prematurely, if/when the next M1 MBP comes out with more than two ports.
They are by far the biggest customer and have a multi-faceted relationship e.g. OLEDs, Modems.
It is my only issue with M1.
And they dominate the competition in performance/power.
godbolt clang compiles it to:
.LBB5_2: // =>This Inner Loop Header: Depth=1
mul x13, x11, x10
umulh x14, x11, x10
eor x13, x14, x13
mul x14, x13, x12
umulh x13, x13, x12
eor x13, x13, x14
str x13, [x0, x8, lsl #3]
add x8, x8, #2 // =2
cmp x8, x1
add x11, x11, x9
b.lo .LBB5_2
[1] https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...Just staring at the machine code, it looks like the hottest loop for wyrng is about 10 instructions with a store in it. If the processor can do that loop in 1 cycle on average then...holy fuck.
edit: I was looking at similar code generated by clang on my machine. Again, holy fuck.
I don't think the story here is that 64x64=128 multiply is fast, honestly. The real story is the insane level of speculation and huge ROB that is necessary to make that non-unrolled loop go so fast. Everything has to go pretty much perfect to achieve that throughput.
.LBB0_2:
eor x13, x9, x9, lsr #30 # 2 \* p1-6
mul x13, x13, x11 # 1 \* p5-6
eor x13, x13, x13, lsr #27 # 2 \* p1-6
mul x13, x13, x12 # 1 \* p5-6
eor x13, x13, x13, lsr #31 # 2 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB0_2 # Fused into 1 \* p1-3
# Total: 11 uops
.LBB1_2:
mul x13, x9, x11 # 1 \* p5-6
umulh x14, x9, x11 # 1 \* p5-6
eor x13, x14, x13 # 1 \* p1-6
mul x14, x13, x12 # 1 \* p5-6
umulh x13, x13, x12 # 1 \* p5-6
eor x13, x13, x14 # 1 \* p1-6
str x13, [x0, x10, lsl #3] # 1 \* p7-8
add x13, x10, #2 # 1 \* p1-6
add x9, x9, x8 # 1 \* p1-6
mov x10, x13 # none
cmp x13, x1 #
b.lo .LBB1_2 # Fused into 1 \* p1-3
# Total: 10 uops
Purely based on number of uops, there's a slight win for wyhash, all other things being equal. However, I doubt that you're really getting one iteration per second here; there are 6 integer units, and even if you perfectly exploited instruction parallelism you're limited to 6 ALU instructions per cycle, which are less than the extent of either loop. It would be possible if the mul-umulh pairs are getting fused, which would bring it down to 8 uops per iteration.Taking into account the port distribution, each iteration of wyhash involves 4 uops being dispatched to ports 5 and 6, which means you should be getting at least 2 cycles/iteration purely for the multiplications. If it's much lower than that, the whole multiplication being fused into a single port 5-6 uop might be right.
However I can neither confirm nor deny that the loops behave like that on the M1, as I don't have one.
The very wide execution is though.
The benchmark probably get rids of that by doing it 40,000 times in quick succession, but why not measure the time of all 40,000 iterations in one go, and decrease the risk (and the overhead of calling gettimeofday 39,999 times)?
What do you mean by this? Apple does very little manufacturing, they're famous for it.
Only in that they design it, not in that they build it.
In that area they have "vertically integrated" nothing.
Why would they want to get into the low-margin, high-risk part of their supply chain, the bit where you can sink billions of dollars and have the value wiped out by a poor choice?
Apple would be insane to get into the middle of it.
This makes me realise how much of a pain it has been to use in the past weeks or so. Now that it is back to normal snappy it is such a pleasure to work with again.
China is engaging in ever more aggressive saber rattling and the total lack of any measurable reaction to their takeover of Hong Kong only has emboldened them. Who can guarantee Taiwan won't end up the same fate?
Taiwan is completely self-governed at the moment and sees itself as an independent nation.
It's just "cheap speakers" bad, though, not anything that would suggest an issue with the sound output from the M1. I've used the LG with a few different Macs, and sound quality is the same from any of them.
edit: but see the comment else thread about the loop iteration time being off by a factor of 2.
It would be easier to test this explicitly instead of inside some unrelated RNG.