Beating the L1 cache with value speculation (2021)

Beating the L1 cache with value speculation (2021)(mazzo.li)

248 points by nickdevx 1 year ago | 80 comments

Remnant44 1 year ago |

It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

xxpor 1 year ago | |

>Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?

johnbcoughlin 1 year ago | | |

Such optimizations may be forbidden by the fact that floating point addition is not associative unless you tell the compiler not to worry about that (I believe)

spockz 1 year ago | |

If this would be possible for an application, does it not make more sense to use SIMD instructions at that point?

gpderetta 1 year ago | | |

You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.

pkhuong 1 year ago |

I see a lot of people asking for a real use case. If you follow the reference chain in the first aside, you'll find this blog post of mine https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-poin.... where we use value speculation to keep MOVS out of the critical path in an interrupt-atomic read sequence for hazard pointers.

Remnant44 1 year ago | |

Very nice! I was struggling to come up with a realistic real world scenario for value speculation, but this is a perfect example. Any kind of multi-thread/contention related code seems like an ideal target for making a guess-based fast-path corrected by a slow-path consistency check

gpderetta 1 year ago | |

Went for the speculation example, stayed for the interrupt-atomic memory copy. Great article.

mistercow 1 year ago |

It’s interesting to me that the final assembly-trick-free version almost no longer looks like a hack.

If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.

metadat 1 year ago |

In case your knowledge of the mechanics of `struct' vs `typedef struct' in C are rusty like mine, here are nice refreshers:

https://stackoverflow.com/a/23660072

https://stackoverflow.com/a/1675446

kristianp 1 year ago |

Per Vognsen (referenced in this blog) is now found on Mastodon at : https://mastodon.social/@pervognsen

He's just published "Finding Simple Rewrite Rules for the JIT with Z3":

https://www.pypy.org/posts/2024/07/finding-simple-rewrite-ru...

https://news.ycombinator.com/item?id=40951900

12_throw_away 1 year ago |

This got me wondering - it's said that C is based on the lie that all computers have the same architecture as a PDP-11. (At least, I'm pretty sure I remember people saying that).

So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?

darby_nine 1 year ago | |

As far as I am aware, this saying is based on the reasoning behind C types rather than serious compiler considerations. In today's world such cpu-specific concerns are left to the compiler to figure out.

I'm sure you could contrive a language where this functionality is exposed, but I'm struggling to come with an example where this would be seriously beneficial across multiple platforms.

I strongly suspect that integrating editors of existing languages with tooling that informs programmers on how a given chunk of code performs with parallel execution units would be far more beneficial than inventing a language dedicated to such concerns at this time.

naveen99 1 year ago | | |

I guess that’s what intel and amd were relying on, while nvidia let cuda programmers control the gpu cache explicitly.

kaba0 1 year ago | |

Well, C++ has the likely/unlikely attribute to somewhat prefer a branch over the other in the eye of the branch predictor, and C++, Rust and some other low-level languages do have native SIMD support (note: C doesn’t have an official one, just compiler-specific ones. So in this vein it is actually higher level than Rust or C++).

gpderetta 1 year ago | | |

Depend what you mean by official. There are likely more compilers implementing GCC vector extensions than there are rust compilers.

queuebert 1 year ago |

I appreciate the elegant blog design. Reminds me of Edward Tufte's books.

candiddevmike 1 year ago | |

It's impressive that it doesn't have a mobile view and still looks great.

ahoka 1 year ago | | |

What? It looks horrible.

AnthOlei 1 year ago | |

Ha, I think this site is styled by a single-sheet CSS called Tufte.css

notpushkin 1 year ago | | |

I don't think it is. In Tufte CSS, sidenotes are implemented using float: right [1], while here CSS Grid is used instead.

[1]: https://github.com/edwardtufte/tufte-css/blob/957e9c6dc3646a...

mwkaufma 1 year ago |

The optimization is the linear memory layout of the nodes -- value speculation is decoration.

Remnant44 1 year ago | |

The linear node layout is not the point at all.

It's serving two purposes here:

1) Providing us with a correct "guess" of what the next node is. 2) Ensuring that in all cases we're running from the L1 cache.

In real world code, you'd be correct -- getting things to run out of L1/L2 is the most important attribute. This is specifically about a micro-optimization that allows you to beat the obvious code even when running completely from cache!

pfedak 1 year ago | |

The example is poorly chosen in terms of practicality for this reason, but otherwise, no, this is a poor summary that misses something interesting.

The memory layout isn't changing in the faster versions, and there are no additional cache misses. It's easy to convince yourself that the only difference between the naive linked list and assuming linear layout is the extra pointer load - but TFA shows this is false! The execution pipeline incurs extra costs, and you can influence it.

mistercow 1 year ago | | |

I think a more practical example might have been to have a mostly contiguous list with a few discontinuous nodes inserted randomly in the middle. That’s more like a real case, and exercises the advantages of linked lists over simple arrays, but should still perform well, since there would only be a few value speculation misses.

WantonQuantum 1 year ago |

This is great! Mostly when I think about branch prediction, I'm thinking about the end of a loop so this was a great read.

There have been a lot of comments about the example presented being quite artificial and I agree but it is simple enough to help the reader understand what's happening and why it's faster.

In fact, it would be fairly common for the nodes in linked lists to be sequential in ram anyway. For example this code shows that the next node is easy to guess. The nodes do end up exactly in sequence in memory:

  #include <stdlib.h>
  #include <stdio.h>

  typedef struct Node {
    int value;
    struct Node *next;
  } Node;

  Node *head = NULL;
  Node *tail = NULL;

  int main(int argc, char **argv) {

    // Allocate some nodes
    for (int i = 0; i < 100; i++) {
      Node *new_node = malloc(sizeof(Node));
      new_node->value = rand();
      new_node->next = NULL;
      if (tail == NULL) {
        head = tail = new_node;
      } else {
        tail->next = new_node;
        tail = new_node;
      }
    }

    // Print their locations in memory
    for (Node *current = head; current->next != NULL; current = current-> next) {
      printf("%p\n", current);
    }
  }

kzrdude 1 year ago | |

That's a controversial part of it. I think that strictly, if the nodes are allocated as part of one array, it is permissible to use current++ to traverse from one to the other. While it would be UB if they are in separate allocations, even if it logically should work all the same way.

WantonQuantum 1 year ago | | |

Oh yes, you're right!

zogrodea 1 year ago |

Value speculation is a neat trick. I was also surprised a low-level hack like this worked in a high-level language like OCaml.

https://news.ycombinator.com/item?id=35844078

PoignardAzur 1 year ago |

It's a neat trick, but I think a linked list (with the very specific layout where nodes are allocated in order) is the only situation where this trick could possibly be useful?

And I think it only works if Spectre mitigations are disabled anyway?

What the trick does is replace sequential fetches (where each fetch address depends on the result of the previous fetch because, well, linked lists) with parallel fetches. It takes the minimum fetch-to-fetch latency from a L1 cache hit (roughly 3 cycles IIRC) to a cycle or less (most CPUs can do multiple parallel fetches per cycle).

If your data is stored in a vector or a B-tree, accesses are already parallel by default and you'll never need this trick.

bee_rider 1 year ago |

Hmm. What do we think about the alternative guess that the address of the next node’s value field is the next address after our current node’s value field memory address? This is, I guess, essentially a guess that we’re pointing at sequential elements of a big array, which sort of begs the question “why not just use an array?” But I’m wonder if a slightly permuted array is not so unusual, or at least might come up occasionally.

mistercow 1 year ago | |

I feel like there are a number of data structures that you might initially set up (or load in) in some preferred contiguous order, which will still remain largely contiguous after modification, so that you get a good tradeoff between cheap operations and fast traversal. You’d then have the option to do partial defragmentation at convenient times, without having to have a convoluted hybrid data structure. But it’s definitely something you’d do in specific critical cases after a lot of analysis.

BobbyTables2 1 year ago |

The example is slightly contrived (the entire linked list was preallocated), but seems like the same technique could be a useful performance optimization, such as if successive calls to malloc commonly return pointers with a particular stride.

alfiedotwtf 1 year ago |

Ok… all the comments are pretty nitty gritty obscure. So unless you’re a compiler hacker or HFT assembly dev, where can someone like me learn all this stuff from (besides Intel/Arm manuals, even though the i386 manuals were nice)

gpderetta 1 year ago | |

Agner Fog's x86 microarchitecture and optimization manuals are a good start.

alfiedotwtf 1 year ago | | |

Optimisation manuals?! Wow that’s the first time I’ve ever heard of them. Thank you!!

oersted 1 year ago |

I enjoyed the read and it taught me new things, I just wish that the reference example would have some minimal practical value.

I don’t think there is any reasonable scenario where you would be using a linked list but the memory is contiguous most of the time.

flysand7 1 year ago |

I may have misread the graphs, but I didn't see the article feature the comparison between the throughput when going over a fully-contiguous linked list vs. randomized linked list?

IshKebab 1 year ago |

Neat trick. Though it seems unlikely to be very useful in practice. How often are you going to know the probably value of a pointer without knowing the actual value? I would guess it's pretty rare. Interesting anyway!

gus_massa 1 year ago | |

It's possible to have a "sparse" matrixes where most of the values are 0 and only a few are not null. So you can guess 0 and cross your fingers.

(There are libraries that implement sparse matrixes in a more memory efficient way. I needed them for a program in Python, but I'm not an expert in Python. I found a few ways, but they are only useful for big matrixes with very few coeficients and have other restrictions to get an improved speed. I finaly gave up and used a normal np matrixes.)

bee_rider 1 year ago | | |

How sparse is your matrix?

bell-cot 1 year ago | |

Bigger-picture, this method amounts to manually assisted speculative execution. And it's not about knowing the not-yet-loaded value, but about knowing what will (very likely) happen as a consequence of that value.

dmoy 1 year ago | |

Well in the articles case, it's the linked list `next` pointers:

https://mazzo.li/posts/value-speculation.html#value-speculat...

In a happy case, those will be laid out sequentially in memory so you can guess the value of the pointer easily.

(That said your comment still stands, since using linked lists in the first place is much more rare). But I suppose there's probably a lot of other domains where you might have a performance critical loop where some hacky guessing might work.

account42 1 year ago | | |

Not only are linked lists rare, they are also mainly useful exactly in situations where you cannot guarantee a (even mostly) linear allocation order.

Bootvis 1 year ago | |

It might be useful in cases where you pre-allocate a large array which you don't randomly access and whose structure doesn't change much but sometimes it does. Then you could either reallocate the array and pay a (large) one time cost or use this trick.

gpderetta 1 year ago | |

In principle a compiler via JIT or PGO could do this optimization automatically.

mnw21cam 1 year ago |

The article states that the CPU has a limit of 4 instructions per cycle, but the sum2 method issues 5 instructions per cycle. Presumably one of them (maybe the increment) is trivial enough to be executed as a fifth instruction.

rostayob 1 year ago | |

gpderetta is right -- test/cmp + jump will get fused.

uiCA is a very nice tool which tries to simulate how instructions will get scheduled, e.g. this is the trace it produces for sum3 on Haswell, showing the fusion: https://uica.uops.info/tmp/75182318511042c98d4d74bc026db179_... .

xiphias2 1 year ago | | |

It's cool, I would love to have this for ARMv8 Mac

gpderetta 1 year ago | |

some nominally 4-wide intel cpus can execute 5 or 6 instructions per cycle when macrofused. For example a cmp and a conditional jXX can be macrofused.

gpderetta 1 year ago |

Nice article!

Incidentally, value speculation (or prediction) is a way to break causality in concurrent memory models.

moonchild 1 year ago | |

depends how you define causality. if you consider the execution of one operation to cause the execution of the next operation in program order, then causality was already broken by simple reordering. if it's a read-write dependency, on the other hand, then it won't be broken (because cpus respect control dependencies); hence, you cannot, for example, replicate oota this way. what's broken is specifically read-read data dependencies. and only on weakly-ordered architectures; it won't do anything on x86

bobmcnamara 1 year ago |

The nodes are adjacent.

gergo_barany 1 year ago | |

The nodes are adjacent in sum1.

The nodes are adjacent in sum2, and sum2 executes more instructions than sum1, and sum2 is faster than sum1.

The nodes are adjacent in sum3, and sum3 executes more instructions than sum1, and sum3 is faster than sum1.

Atharshah 1 year ago |

I will change my game level

uint64_t sum5(Node *node) { uint64_t value = 0; Node *next = NULL; for (; node; node = node->next) { for (;;) { value += node->value; if (node + 1 != node->next) { break; } node++; } } return value; }

while (node) { value += node->value; next = node->next; node++; // Line 101 if (node != next) { node = next; } } // Compiler warning // Warning: Lines 101 102 103: always true evaluation combined