Print(“lol”) doubled the speed of my Go function

Print(“lol”) doubled the speed of my Go function(medium.com)

255 points by ludiludi 2 years ago | 126 comments

Syrail_ 2 years ago |

I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:

Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).

This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!

In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.

In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.

In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.

I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.

Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!

dmurray 2 years ago | |

Yes, this part seems wrong

> Following standard practice, I use the benchstat tool to compare their speeds

That tool (at least as used here) would be suitable for comparing execution of the same code between two processors with the same architecture. For comparing different programs on the same architecture, you need a different tool that focuses on total execution time.

dataflow 2 years ago |

I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?

Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325

Calavar 2 years ago | |

I read it three or four times. It's never explained. If the print("lol") version has a branch-less-than, what does the regular version have? It must either be a branch or a conditional move, but we aren't shown that part of the assembly. You can't reach any conclusions about why one version is faster if you don't know what you're comparing to.

kevincox 2 years ago | |

The high-level view is that adding code that never gets executed causes the compiler to emit code that the CPU predicts better. IDK if this is the compiler assuming that the `print()` call is cold or the branch predictor getting luckier by chance but basically this tickles the CPU in the right way to get better performance.

It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.

trolan 2 years ago | |

I'm in school, so this may be oversimplified, but if the processor/assembly code is predicting the next result, it gets the result faster. The processor only does this prediction with conditional branches. The extra if for printing or finding the min invoke the prediction with the accuracies stated.

tallanvor 2 years ago | | |

No, this is not true.

There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.

Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.

dataflow 2 years ago | | |

> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.

EspressoGPT 2 years ago | |

As to branch prediction, for anyone interested: https://stackoverflow.com/a/11227902

bakul 2 years ago |

Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. Deterministic performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually pessimize performance on a future processor or a compatible processor by a different vendor).

If you are interested in this sort of thing, check out comp.arch!

aeonik 2 years ago | |

I've been researching this pretty deeply for the last few years, and I've come to the conclusion that, without a complete redesign, most popular programming languages cannot have direct control of these optimizations in an ergonomic manner.

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.

mcv 2 years ago | |

I would agree, but it's hard to argue with a factor 2 performance boost.

But these kind of tricks feel like we need to con the compiler into optimising this correctly, which is of course ridiculous. What we probably need instead is if-statements that we can tell what's most likely the correct prediction.

Something like:

  if v > maxV predict true
    maxV = v
    continue

Zinu 2 years ago | | |

It's only factor 2 with an increasing array though. At which point you can just take the last element, that's way faster.

So really you end up having to make assumptions about the input to get the performance boost.

cyphar 2 years ago |

In the Linux kernel, there are unlikely() and likely() macros which indicate to the compiler whether or not a condition is likely using __builtin_expect (which then influences the output assembly into producing code that should make the branch predictor do the right thing more of the time).

Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?

nomel 2 years ago | |

It's nice using these to mark less-likely, but latency sensitive, paths, which is something that profiler guided optimization can't do.

zerr 2 years ago |

No explanation whatsoever. Why the branch predictor is not "invoked" in the first version of the function?

MauranKilom 2 years ago | |

Because it's most likely using a conditional move.

distcs 2 years ago | | |

But I don't see the post going into investigating this at all. Yes, most likely that is what is going on but I don't understand the point of the OP post is if the real reason of the difference in the branch predictor behavior is not explained.

perryizgr8 2 years ago |

Why would an unconditional print have any effect on whether the branch predictor is invoked or not? The if statement is there in both cases, so branch prediction should kick in for both. I didn't find an explanation for this behaviour in the article.

zhzy0077 2 years ago |

I'm a noob. Looking at the disasm: https://godbolt.org/z/766aPTPc3 It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication? I don't get it.

ludiludi 2 years ago | |

Your disasm is for x86-64. The benchmarks in the blog were run on an M1 MacBook Pro, which is an ARM64.

zhzy0077 2 years ago | | |

Sorry. My bad. But looking at ARM64 https://godbolt.org/z/YEjGKce1Y The difference is CSEL and BLT. The question still stands. Does CSEL have no branch predication?

schemescape 2 years ago |

Why is there a "continue" at all in the first code sample?

Edit to add: does removing it make any difference?

nemetroid 2 years ago | |

The inclusion of ”continue” in the non-lol version is pointless and obscures the actual reason for the difference: the addition of the non-pointless ”continue” in the lol version.

As other comments point out, this construct can be replaced by a cmov instruction:

  if a > b:
    b = a

The following construct however, cannot be replaced by cmov:

  if a > b:
    b = a
    continue

Only by first eliminating the pointless "continue" is this replacement valid. But by including it, you can make it look like it's the 'print("lol")' is what makes the difference, which is only true lexically.

fmstephe 2 years ago | |

According to the godot compiler explorer removing the `continue` makes no difference to the generated assembly.

https://godbolt.org/z/ds1raTYc9

https://godbolt.org/z/rbWsxM83b

The `print("lol")` output looks remarkably different.

https://godbolt.org/z/c3afrb6bG

ludiludi 2 years ago | |

Good question. As you can see in the comment in the github repo, it has no effect. https://github.com/ludi317/max/blob/master/blog/max_test.go#...

It is there only to match the continue in the second code sample, where it is needed.

schemescape 2 years ago | | |

Thanks! In that case, I have to say I'm surprised. I assumed the code generated for the loop would have an instructions that branches, so adding another branching instruction could only hurt (edit: not necessarily a lot), but apparently my intuition is wrong.

I'm curious if the performance difference noted in the article happens on Intel/AMD as well...

Liquid_Fire 2 years ago |

I was curious what this strange assembly language was, as it looked like neither Arm nor x86.

Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm

I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.

rob74 2 years ago | |

I guess it's for historical reasons. As the document you linked states, "The assembler is based on the input style of the Plan 9 assemblers". It's important to know that at least two of the "founding fathers" of Go (Rob Pike and Ken Thompson) are ex-Bell Labs guys and were involved with Plan 9. The Plan 9 compiler toolchain was available, they were familiar with it, so that's what they used for Go. Some parts of the toolchain (the linker, I think) have been swapped out in the meantime, but the assembly format has stayed.

EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:

> The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.

...which is referring to Ken Thompson I guess.

yencabulator 2 years ago | | |

It's not just historical, it's more "the same justification as back then".

yencabulator 2 years ago | |

Rob Pike's talk The Design of the Go Assembler from GopherCon 2016: https://www.youtube.com/watch?v=KINIAgRpkDA

deschutes 2 years ago |

The explanation is not convincing.

My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.

smcl 2 years ago |

Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?

grose 2 years ago | |

Seems like the answer is no[1] and profile-guided optimization is recommended instead, https://go.dev/doc/pgo. I would be curious to see if pgo helps with the author's use case.

[1] https://groups.google.com/g/golang-nuts/c/1erdKe3aV5k

smcl 2 years ago | | |

Ah thanks! That's interesting but a bit weird to me. That response sounds a little bit like someone who feels like they shouldn't do something and is thinking on-the-fly for reasons they can use to justify that feeling.

> We don't want to complicate the language

So I can understand if this complicates the implementation but I don't know if totally optional pragmas or annotations complicates the language itself. Like C has this but I don't think people say "Ah C is alright but the pragmas are a bit confusing and make things complicated".

> experience shows that programmers are often mistaken as to whether branches are likely or not

Your average programmer may mess that up, but those who would give optimisation hints aren't quite your average programmer. And insisting on introducing PGO to your build process (so build, run-with-profile, rebuild-with-profile) for some cases where someone isn't mistaken as to whether branches are likely (or whether some loops run minimum X times, etc) feels a bit needless.

Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.

_cenw 2 years ago |

Go can have unexpected performance differences way higher up in the stack.

Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.

I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.

[1]: https://pkg.go.dev/runtime#Gosched

morelisp 2 years ago | |

The Go scheduler is now (well, for years) preemptive.

Exuma 2 years ago |

There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?

romshark 2 years ago |

I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: https://go.dev/play/p/zHFxwvWOoeT

-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?

AshleysBrain 2 years ago |

Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`

That could be entirely branchless, right?

assbuttbuttass 2 years ago | |

A max function still compiles down to some kind of branch

AshleysBrain 2 years ago | | |

I thought there were specific assembly instructions for this kind of thing, such as MAXSS in x86 [1], plus vector variants like SSE4 PMAXSD. Presumably it's possible the CPU can handle those with special branchless logic, depending on the compiler and CPU implementation. I guess you'd have to know about the CPU internals to know if the instruction is truly branchless, but it is branchless in the sense there is no conditional jump made in the assembly instructions.

[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...

vsnf 2 years ago |

Kind of tangential, but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica? I've only ever meet people like that maybe two or thee times in my career, and yet it seems like every other blog post I read in certain language circles everyone is some kind of ASM and Reverse Engineering expert.