Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.
This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!
Clever.
Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?
If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.
He's just published "Finding Simple Rewrite Rules for the JIT with Z3":
https://www.pypy.org/posts/2024/07/finding-simple-rewrite-ru...
So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?
I'm sure you could contrive a language where this functionality is exposed, but I'm struggling to come with an example where this would be seriously beneficial across multiple platforms.
I strongly suspect that integrating editors of existing languages with tooling that informs programmers on how a given chunk of code performs with parallel execution units would be far more beneficial than inventing a language dedicated to such concerns at this time.
[1]: https://github.com/edwardtufte/tufte-css/blob/957e9c6dc3646a...
It's serving two purposes here:
1) Providing us with a correct "guess" of what the next node is. 2) Ensuring that in all cases we're running from the L1 cache.
In real world code, you'd be correct -- getting things to run out of L1/L2 is the most important attribute. This is specifically about a micro-optimization that allows you to beat the obvious code even when running completely from cache!
The memory layout isn't changing in the faster versions, and there are no additional cache misses. It's easy to convince yourself that the only difference between the naive linked list and assuming linear layout is the extra pointer load - but TFA shows this is false! The execution pipeline incurs extra costs, and you can influence it.
There have been a lot of comments about the example presented being quite artificial and I agree but it is simple enough to help the reader understand what's happening and why it's faster.
In fact, it would be fairly common for the nodes in linked lists to be sequential in ram anyway. For example this code shows that the next node is easy to guess. The nodes do end up exactly in sequence in memory:
#include <stdlib.h>
#include <stdio.h>
typedef struct Node {
int value;
struct Node *next;
} Node;
Node *head = NULL;
Node *tail = NULL;
int main(int argc, char **argv) {
// Allocate some nodes
for (int i = 0; i < 100; i++) {
Node *new_node = malloc(sizeof(Node));
new_node->value = rand();
new_node->next = NULL;
if (tail == NULL) {
head = tail = new_node;
} else {
tail->next = new_node;
tail = new_node;
}
}
// Print their locations in memory
for (Node *current = head; current->next != NULL; current = current-> next) {
printf("%p\n", current);
}
}And I think it only works if Spectre mitigations are disabled anyway?
What the trick does is replace sequential fetches (where each fetch address depends on the result of the previous fetch because, well, linked lists) with parallel fetches. It takes the minimum fetch-to-fetch latency from a L1 cache hit (roughly 3 cycles IIRC) to a cycle or less (most CPUs can do multiple parallel fetches per cycle).
If your data is stored in a vector or a B-tree, accesses are already parallel by default and you'll never need this trick.
I don’t think there is any reasonable scenario where you would be using a linked list but the memory is contiguous most of the time.
(There are libraries that implement sparse matrixes in a more memory efficient way. I needed them for a program in Python, but I'm not an expert in Python. I found a few ways, but they are only useful for big matrixes with very few coeficients and have other restrictions to get an improved speed. I finaly gave up and used a normal np matrixes.)
https://mazzo.li/posts/value-speculation.html#value-speculat...
In a happy case, those will be laid out sequentially in memory so you can guess the value of the pointer easily.
(That said your comment still stands, since using linked lists in the first place is much more rare). But I suppose there's probably a lot of other domains where you might have a performance critical loop where some hacky guessing might work.
uiCA is a very nice tool which tries to simulate how instructions will get scheduled, e.g. this is the trace it produces for sum3 on Haswell, showing the fusion: https://uica.uops.info/tmp/75182318511042c98d4d74bc026db179_... .
Incidentally, value speculation (or prediction) is a way to break causality in concurrent memory models.
The nodes are adjacent in sum2, and sum2 executes more instructions than sum1, and sum2 is faster than sum1.
The nodes are adjacent in sum3, and sum3 executes more instructions than sum1, and sum3 is faster than sum1.
I want to understand, in:
uint64_t sum5(Node *node) {
uint64_t value = 0;
Node *next = NULL;
for (; node; node = node->next) {
for (;;) {
value += node->value;
if (node + 1 != node->next) {
break;
}
node++;
}
}
return value;
}
How is this magic: if (node + 1 != node->next) {
break;
}
Better than the more intuitive: if (node->next == null) {
break;
}
Also.. why is `node + 1' even a valid comparison? (Please forgive my rusty C/C++, college was awhile ago)In C and C++, when you add 1 to a pointer, you actually make it point to the next object of that size in memory, e.g. if I have a pointer to a 4-byte integer at address 0x8000, incrementing it will make it point to address 0x8004.
Because arrays are contiguous chunks of memory, you can use this to iterate through arrays.
int array[50];
int *p = &array[0]; // Take the address of the first array element
int sum = 0;
for (int i = 50; i != 0; i--) {
sum += *p;
p++; // Point to the next element
}
After this loop, p will be equal to &array[50], which is one past the last element of the array, because the loop will have run 50 times, and p is incremented once per loop.What OP did is allocate an array of linked list nodes, and test to see if the next linked list node was actually just the next array element.
while (node) {
value += node->value;
next = node->next;
node++; // Line 101
if (node != next) {
node = next;
}
}
// Compiler warning
// Warning: Lines 101 102 103: always true evaluation combined
The pivot here is moving the fallback out of the tight inner loop, this also introduces a new statement block. The additional loop exit condition is dependent on values in memory and cannot be optimized out; since it's now directing flow control directly instead of always assuring the value of node becomes node->next (an indirect method of flow control). while (node) {
while (node) {
value += node->value;
if (node + 1 != node->next) break;
node++;
}
node = node->next;
}- Create the list in some weird order. You know you're going to traverse it a bunch, so you sort it.
- The list is in creation order, and creation is in allocation order. Once in a while you go back and insert or delete a small handful of nodes, hence the linked list.
- There is a natural sort order that you can make contiguous, but you support relatively rare alternative orderings and optimize for the natural order.
Then again, I pretty much agree with you. I think it's a clever trick, but I can't come up with a time when I would use it. Largely that's probably because if the memory is contiguous most of the time, then you should probably be using a vector instead. You can insert/remove by shifting stuff around to handle the rare cases that require the linked list. If performance cliffs are a problem, you can mitigate with a segmented array.
Someone mentioned sparse vectors for example, where you are guessing 0 rather than pointer++. Anytime where a value is somewhat predictable this could come in handy.
* I tend to default to a simple vector because it's denser, more CPU cache friendly, less pointer chasy.
* If I really wanted a linked list to avoid occasional expensive ops, I probably would want to be able to grow it without reallocation (much less touching up all the old pointers), so they might all be in separate allocations, and most general-purpose memory allocators won't give me the nice sequential-ness.
* And then I'd probably end up with an unrolled linked list (probably the same thing you're describing as a "segmented array"?), which would reduce the pointer chasing and also make better use of the memory bandwidth through better density, so it'd outperform this trick.
* I guess if I really wanted to be able to reorder stuff within a vector cheaply but was still comfortable with the amortized constant time for growth, I might have represent the links as indices within the vector (or in a parallel vector). It'd be useful then. Maybe something like a linked hash map too. But if those ops are common, the 100% accurate prediction of this benchmark is way too optimistic. Could sort it sometimes as you mentioned, but that seems even more niche and fussy.
There might be other cases of this trick that I'm more likely to use than this data structure, but I'm scratching my head to think of them.
Ah yes, fun and safe math optimizations.
The version in the Compiler Explorer wouldn't work on AArch64 without an -mcpu flag and I didn't know what to pass, so I copied -mcpu=cyclone from https://djolertrk.github.io/2021/11/05/optimize-AARCH64-back.... You'd have to look up the correct one for your Mac's CPU.
In al old case they like 1000x1000, like a block checkboard where one of the colors had only 0, so like 50% full, but the blocks had different sizes. It was an old project in Fortran, and we used a "for"(do) for the blocks and another inside each block.
Bottlenecks are introduced by dependencies, like if an instruction modifies one register and the next one immediately uses its value. With a pipeline, the result of the previous instruction might not be written back to the register in time for the current instruction to read the correct value from it, causing the CPU to execute the instruction incorrectly. So, you have to stall the pipeline and do nothing until that register is written to.
Memory loads cause bottlenecks for similar reasons. In fact, loading memory is usually the slowest instruction.
One way of getting around this is to execute instructions out of order, but do it in such a way that it looks like it's executing in-order. For dealing with branches, you can speculatively execute both branches, but only commit one result to the CPU state.
By testing whether the next node in memory was actually the next node in the list, the CPU could speculatively start executing the next iteration of the loop body as though it were; if it turns out not to be, then it can just not commit the results. If it is, then you've avoided a pipeline stall.
Note: I'm not a CPU expert; I just took a class in computer architecture recently. I could be mistaken about something, so feel free to correct me.
It seems unfortunate the compiler isn't able to realize "no further assignments are ever performed on node->next, it is side-effect dependency free, optimize the spec exec accordingly". Though, how often would this be the case in a real-world program of actual utility.. probably rare.