Print(“lol”) doubled the speed of my Go function(medium.com) |
Print(“lol”) doubled the speed of my Go function(medium.com) |
Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).
This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!
In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.
In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.
In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.
I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.
Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!
> Following standard practice, I use the benchstat tool to compare their speeds
That tool (at least as used here) would be suitable for comparing execution of the same code between two processors with the same architecture. For comparing different programs on the same architecture, you need a different tool that focuses on total execution time.
Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325
It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.
There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.
Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.
This sounds... wrong? Unless ARM64 is designed in an absurd way?
I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.
If you are interested in this sort of thing, check out comp.arch!
The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.
To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.
I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:
https://github.com/trailofbits/vast
I'd definitely like to see the compilation processes be more transparent and easy to work with.
But these kind of tricks feel like we need to con the compiler into optimising this correctly, which is of course ridiculous. What we probably need instead is if-statements that we can tell what's most likely the correct prediction.
Something like:
if v > maxV predict true
maxV = v
continueSo really you end up having to make assumptions about the input to get the performance boost.
Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?
Edit to add: does removing it make any difference?
As other comments point out, this construct can be replaced by a cmov instruction:
if a > b:
b = a
The following construct however, cannot be replaced by cmov: if a > b:
b = a
continue
Only by first eliminating the pointless "continue" is this replacement valid. But by including it, you can make it look like it's the 'print("lol")' is what makes the difference, which is only true lexically.https://godbolt.org/z/ds1raTYc9
https://godbolt.org/z/rbWsxM83b
The `print("lol")` output looks remarkably different.
It is there only to match the continue in the second code sample, where it is needed.
I'm curious if the performance difference noted in the article happens on Intel/AMD as well...
Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm
I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.
EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:
> The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.
...which is referring to Ken Thompson I guess.
My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.
> We don't want to complicate the language
So I can understand if this complicates the implementation but I don't know if totally optional pragmas or annotations complicates the language itself. Like C has this but I don't think people say "Ah C is alright but the pragmas are a bit confusing and make things complicated".
> experience shows that programmers are often mistaken as to whether branches are likely or not
Your average programmer may mess that up, but those who would give optimisation hints aren't quite your average programmer. And insisting on introducing PGO to your build process (so build, run-with-profile, rebuild-with-profile) for some cases where someone isn't mistaken as to whether branches are likely (or whether some loops run minimum X times, etc) feels a bit needless.
Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.
Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.
I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.
-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?
That could be entirely branchless, right?
[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...
In the original, it essentially faces
If <cond>:
Mov
Else:
Noop
Considering these branches unpredictable, it generates a CMOV.With
If <cond>:
Mov
Else:
Print
It now considers the first branch hot and the second cold, and thus branch predication valuable, and generates a branch instead.Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable, so all the conditional move does is create an unnecessary data dependency.
It’s not necessarily the wrong choice in general, as for unpredictable branches cmov will usually be a huge gain, they’ll incur a cycle or two of latency but save from 15+ cycles of penalty on a mispredicted branch (which if the prediction only works half the time is an average 7.5 cycles per iteration).
You can find older posts which demonstrate that side of the coin e.g. https://owen.cafe/posts/six-times-faster-than-c/
Happens to be extremely predictable for this data. In general, over all possible inputs, it’s not extremely predictable.
If you assume all inputs are different (not something the compiler can assume, of course) the probability of having to update the max value goes down from 1 for the first iteration to 1/n for the last, so, possibly, the loop should be split into two halves. Go through the start of the sequence assuming the value needs updating more often than not, and switch to one where it doesn’t at some point.
For truly large inputs you could even add heuristics looking at how much room there is above the current maximum (in the limit, if you’ve found MAX_INT, you don’t have to look further)
Sorting programs used to have all kinds of such heuristics (and/or command line arguments) trying to detect whether input data already is mostly sorted/mostly reverse sorted, how many different keys there are, etc. to attempt avoiding hitting worst case behavior, but I think that’s somewhat of a lost art
Case in point, I'm slowly being replaced by Salesforce muppets for all my projects at work. They're little code monkeys with amazon ebook type knowledge, projects cost 20x more and I look like the mad scientist for speaking the truth. The products are worse in every possible metrics, I'm not crazy. The politics at play is the reason why I'm losing ground, not logic.
Cabinet designers are being replaced by Ikea flat pack artists in the software world. All we can do is stand by and watch.
And in regards to this blog, when Medium eventually go, that knowledge will go too. Blogs have died, personal websites as well, and their ability to be found in Google is almost non-existent.
Sorry I don't have anything more positive to add, except maybe that they're still there, slowly being alienated by the modern tech world!
Partly because that's often not what we're supposed to do; the stuff under the hood "just works" and we're meant to use it to write features, not worry about optimising the stuff that happens under the hood.
And partly it's because the stuff under the hood is increasingly weird and bizarre. Branch prediction is weird, and I still don't understand why that extra print statement changes the branch prediction. Why does it predict `v > maxV` is true when the alternative is to print something, but it doesn't predict that when the alternative is to do nothing?
Is it because printing is expensive, and therefore the branch predictor is going to strongly prefer avoiding that? It's weird that we'd basically have to deceive our code into compiling into a more performant form.
I don't want to have to second guess the compiler.
But under the hood there is a hood. And under that hood there is another hood. And under that hood there is a brand new car you don't know how to open the hood, and so on.
I cannot devote my life to know everything or I won't be able to provide for my family.
That analogy doesn't really work if the new projects "cost 20x more" upfront, unless you mean long-term associated costs.
My grouchy mother in law also laments that they teach typing in school and not handwriting.
Lamentable? I guess. But it’s not realistic to hope people just learn more every generation.
Despite the tradition of “kids these days with their frameworks and libraries!” complaining, you want to see this evolution.
Well, layering abstraction is the most effective way to deal with complexity. Our brain is too limited to know everything.
The people, who understand "under the hood", probably won't know much about what's "above the hood", like writing a website or mobile app. Therefore, we need experts on every layers.
You encounter far far more dead-ends than anyone ever says, and every unsolved mystery is a mild nerd snipe, an open case, that years from now you'll see someone else explain something you realise it answers that question from years prior.
For me, the hard bit is not over-indexing on this... you learn things, but biasing too much for them is a sure fire way to over-engineer or increase complexity to the point where something is now worse for you knowing something. But once in a while that tiny thing you learned years before is a 20% savings across the board with associated performance increase and everyone wondering how on Earth you could possibly have made those jumps.
Also related... incidents. "Why" and "I'm going to find out" is the best way these things don't recur in future. A high degree of observation and understanding is a happy engineer life as it can improve what can often be the most stressful parts of the work (on-call, etc).
That XKCD comic about everyone learning something for the first time factors too... there is stuff you know that others do not, share it.
For me that means that in our world of computers there is infinite curiosities to discover. (Not that the same isn't true for the natural world too)
And it drives me nuts dealing with people who don't think this way. I'm not a jerk about it, but my personality is 'lets what all we can figure out the why'.
These skills stay with you, and if you read articles like this then you can keep broadly up-to-date with the insanity that is current CPUs. Things like pipe-lines, branch prediction, and different levels of cacheing are optimisations that you can acquire as you go.
If you're an auto-didact web developer then you never have the opportunity to learn these skills, or the need to do so.
I know a lot of people who are comfortable with doing this, but in my case it's a generational thing. If you want to do it then you can. It's not hard, it's just a different skill from those you already have, though sufficiently related that you wouldn't be starting from scratch.
But starting with modern CPUs can be hard. Learning the basics from older, simpler CPUs can help. Doing some kind of embedded programming might be the way to get started, or working on an emulator.
As always, YMMV.
Not sure if it's still asked, but reminds me of a class interview question. "When a user presses a button on a web page, describe in as much detail as possible what happens."
That being said, of all the people at all of the tech companies I've worked at, maybe ~5% of them had this sort of mentality and drive to execute on it.
And remember Spectre and Meltdown? Security vulnerabilities caused by branch prediction. If I recall correctly, the pipeline was executing code it wasn't meant to execute because it's executing it before it knows the result of the check that decides if it has to execute it.
Programming is a lot easier if the actual control flow is as linear as I'm writing it.
My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.
Seriously that threw me, and maybe it makes sense in this context but it seems strange for someone with such an apparent depth of technical knowledge leaning on an LLM for anything.
i don't see any assembly here. the analysis is done by using a profiler. a very common tool available for most programming languages.
https://timotijhof.net/wp-content/uploads/2020_profiling_fig...
People who tend to care tend to talk about what they care about. So, you wind up getting a lot of blogs who sort of self select themselves to do this kind of stuff. And everyone feeds off that energy and gets better :)
If you visit #c / #asm on any popular IRC network, you'll find a lot of skilled people that can do this routinely.
Also, if you read the other subthreads, there are a number of points to be criticized about this writeup.
The CPU's branch predictor won't be able to perform that kind of algorithmic analysis, but patterns like the above also work reasonably well for simpler heuristics like 'predict the same outcome as the last time the branch was taken'.
#include <algorithm>
using std::max;
int max_array_func(int values[], size_t values_count)
{
int max_value = values[0];
for (size_t j = 0; j < values_count; j++)
{
max_value = max(max_value, values[j]);
}
return max_value;
}
int max_array_bittwiddling(int values[], size_t values_count)
{
int max_value = values[0];
for (size_t j = 0; j < values_count; j++)
{
int x = max_value;
int y = values[j];
// http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
max_value = x ^ ((x ^ y) & -(x < y));
}
return max_value;
}
int max_array_branch(int values[], size_t values_count)
{
int max_value = values[0];
for (size_t j = 0; j < values_count; j++)
{
if (max_value > values[j])
{
max_value = values[j];
}
}
return max_value;
}> My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days.
That seems like an overreaction. For most applications, cache locality and using the right algorithms is more important, if performance is an issue at all.
Speculative/optimistic techniques are not limited to branch prediction, you encounter various forms of it pretty much everywhere, without knowing it.
* Your hard drives have a read ahead buffer, fetching in advance future data, just because most reads are immediately followed by an other one for the data just after.
* Your CPU instructions are pre fetched, because most instructions are _not_ jumps, so you will 99% of the time just execute the instruction right after it.
* Instruction reordering where your compiler will decide that future instructions not affected by previous instructions could be run in advance.
* Overall any kind of caching is some form of optimistic technique. You are preparing future results.
If you think about it, optimistic/speculative techniques are ubiquitous, and used even at _high_ abstraction levels.
The famous python mantra of "better ask for forgiveness than permission" embodies that spirit. It encourages a coding style of "try: do() except: nope()" rather than "if check: do()".
Standing "against optimistic/speculative techniques" is standing against transactions, rollbacks, caches. It's just not a viable line of thinking IMHO.
Such broad "optimization rules" usually don't make much sense on any compiler created in the last 25 years. An optimizing compiler will analyse and optimize your program's control flow at a very high level, not just naively insert a conditional branch for each if (and that's also true in "low level" languages like C)
If you actually care about such things, you really need to look at the compiler output and incrementally tweak your code (but then other compilers or even compile options might destroy such tweaks again), there hardly is a single optimal source code solution across CPU architectures or even just different CPU models of the same architecture (just don't do any obviously stupid things which make the life of the compiler harder than need be)
I see COTS products using ldap memberof queries without LDAP_MATCHING_RULE_IN_CHAIN and stating definitively in their documentation that nested groups are bad (despite decades of best practice).
I see product documentation recommending authenticating against LDAP instead of kerberos, despite the underlying libraries having full kerberos support.
I see sslverify: no, and flags to ignore SSH TOFU warnings, and recommendations to avoid SSH gssapi-keyex (WHY?????), and security approached by buying ever more products creating ever more complexity.
Yes, things "just work" in a horrible, 'youre stuck with your vendors forever' sort of way that results in lengthy outages every 6 months due to mounting, intractable technical debt. But things don't have to be this way, you just need people who are willing to ask "why" or "is that necessary" or "can it be better".
No one wants to have to second guess the compiler. And it's almost never the compiler, until it is. As a grad student, I remember someone else in our lab spending a long time on a particular issue until he tracked it down to a not yet reported bug in gcc.
99% of the time you can rely on the compiler (or the language runtime, or popular library X), but we also need to be able to identify the rare cases where the lower levels of the stack are broken and how to either fix or work around them.
no one is asking you to do this optimization all the time. But you need to know what and how, if you were to be tasked. Then as you get more senior, you will have to also be able to judge whether it is possible, or worth the effort to devote to the optimization, should a scenario arises.
> second guess the compiler.
you aint second guessing it, you're examining the output, and understanding it. Then potentially recognizing when or if the compiler is outputting bad stuff. That's not second guessing.
The article is under-informing you: they should have shown the full disassembly for both loops, not just the bit in question. I suspect the explanation as given is slightly wrong, but there's no way to disprove it without replicating the setup myself.
"So far, we have seen examples that use branches to handle decisions. The A64 instruction set also provides conditional select instructions. In many cases, these instructions can be used as an alternative to branches."
Seems CSEL is not a branch.
Here it turns out to be a very bad choice, because it creates unnecessary data dependencies. But a cmov would be a fine choice if the branch was impossible to predict (e.g. if the input was random, well even then I’d expect the limit to mostly creep up so it should be predicated as mostly cold).
The compiler was from a limited subset of TRS-80 BASIC to Z80. It was written in its own subset, so it could compile itself.
I had 16K of RAM, and that had to fit the BASIC source, the compiled version, and then that compiled version compiled the BASIC source to another compiled version in a different location, which could then be saved on tape.
Interesting times, and so foreign to today's experiences that it's hard for people to follow the details. I could write it up, but I'm pretty sure no one would derive any value from reading it. I sometimes wonder if the hand-written syntax flow charts and first draft are still somewhere in my piles of undiscarded papers. I doubt it ... it as 1979 when I was doing it, and I've moved continents since then.
Every high level library is an abstraction and abstractions are leaky.
I'm not saying you have to be able to write assembly by hand. I'm saying you should understand why your code runs 10x slower in docker.
(besides, learning to read a bit of assembly code never hurt anyone)
Knowing how things work fundamentally is what matters the most, as everything else can be inferred and deduced from that. This is universally true. When you know how something works, you can build up on that.
There's no value for a programmer below how processors work with registers, caches and bus accesses. The most fundamental interface between the programmer and the computer are the registers, caches and interrupts, which make everything else happen.
Understanding how these are being used practically and properly covers a LOT of ground to build up upon.
On the other side you have the people who are fully dependent on compilers to do their job for them. They don't write code for the machine, they write code for the tool.
They don't actually know what they're doing, they have no idea how it works and they believe that the compiler optimizes it all properly anyway, as if it's magic, with no fucking clue what the fuck they're actually doing, while patting themselves on the back for living in dependency.
Why do programmers need garbage collection? Because people have no idea how not to need it. Why are pointers considered bad? Because some assholes decided that telling everyone they're too complex to use was a smart idea instead of teaching them properly, which, of course, it fucking wasn't.
If people would actually know how to do proper memory management, we'd had far less security issues from the get go instead of requiring patches for software that was written in a stupid way, ignorant of memory mapping, pages, etc ... you know, fundamental knowledge and understanding of what's actually happening.
It is considered brilliant to use mmap for faster, more memory efficient file access ... which is absolutely ridiculous, because it exposes how little programmers actually know about the fundamentals!
Using mmap for sharing information between processes and reading and writing files has been my go-to since forever, simply because every other way is literally doing it wrong.
Anyone saying that not every programmer needs to know all of this is missing the point. Of course you don't need to, but the status quo is built upon exactly that line of thinking.
Are you better off being able to, at least moderately, repair your car yourself, or are you better off depending on someone else to do it? The answer is obvious, and the same applies to literally everything.
Less dependency should always be the top priority, because less dependency leads to better understanding and increased freedom ... but we're living in a world that's doing the exact opposite.
Gonna stop writing now, because there's no end to this.
I am confused by this behaviour, and although I definitely don't know what the answer is here; the non-lol version does have a CSEL (https://developer.arm.com/documentation/dui0802/b/CSEL) which is totally missing from the lol version.
Non-lol https://godbolt.org/z/ds1raTYc9
The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)
However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!
On the other hand, conditional moves would be much faster for a loop like:
bigs := make([]int, len(numsA))
for idx, v : range numsA {
if numsA[idx] > numsB[idx] {
bigs[idx] = numsA[idx]
} else {
bigs[idx] = numsB[idx]
}
}
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).
At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.
I don't know how Go codegen works, but I assume they do something similar.
Would you mind expanding? If the conditional move isn't needed, and given that it's costly in terms of perfs, how is that “totally sensible” to have one here?
My best guess is that it's because of the added cost to the other branch: printing is expensive, so the compiler really doesn't want to do that, and would prefer to incorrectly predict not-printing than to incorrectly predict printing.
If that tradeoff is what causes the difference in behaviour here, then I understand. But I don't like it, because it's ridiculous and we shouldn't have to con the compiler into doing what's right. If we're going to have to do this, I'd rather have the language provide a way for us to make explicit what we expect here.
if a > b
b = a
continue
else
continue
you can replace the entire if-else with a cmov. but if there is a function call in one of the prongs and not in the other, you cannot.Since the function-with-if version requires a branch, there's no advantage (regardless of predictability) in using a conditional move for the assignment to maximum.
Also my understanding based on reading this thread is that a conditional move is harder to speculate across because of the data dependency carried across iteration loops, preventing speculation. With a conditional branch, speculation can keep executing the next loop and most of the time the branch predictor will guess correctly how to execute the next iteration for this problem.
maxV = v
maxV = v
maxV = v
maxV = v
over and over (with tests running in parallel as well, but those matter less, because they are mostly just "checking" whether the branch predictor guessed correctly.).Then, in that above sequence, there is no /read/ of maxV after each write, so there is no true data dependency. (We don't need to wait for the last write to finish before we write again, unlike the read after write case.)
What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.
I don’t have enough compiler knowledge to be able to definitively answer your question, but I think it would be a reasonable heuristic to favor emitting a branch over a cmov in cases where the cmov participates in a loop-carried data dependency.
The other advantage of emitting a branch is that branch prediction can better adapt to runtime conditions. Branch prediction tends to struggle when data is perfectly random, but most data is not. So in general it’s also reasonable for compilers to default to emitting branches.