Myths Programmers Believe about CPU Caches (2018)

Myths Programmers Believe about CPU Caches (2018)(software.rajivprab.com)

366 points by noego 6 years ago | 119 comments

strstr 6 years ago |

If cache coherence is relevant to you, I strongly recommend the book “A Primer on Memory Consistency and Cache Coherence”. It’s much easier to understand the details of coherency from a broader perspective, than an incremental read-a-bunch-of-blogs perspective.

I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.

Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).

If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.

deepaksurti 6 years ago | |

“A Primer on Memory Consistency and Cache Coherence” is part of the "Synthesis Lectures on Computer Architecture" which are 50-100 page booklets on topics related to HW components. All the booklet PDF's are available online [1].

edit: only those PDF's with a checkmark are available as PDF to download, the rest can be bought. Quite a few actually available for download.

[1] https://www.morganclaypool.com/toc/cac/1/1

jblow 6 years ago | |

Disagree on that last part. If more programmers understood cache coherency, maybe their programs would not run like a giant turd.

fulafel 6 years ago | | |

Are the details really helpful for performance work? MOSI, MOESI, MERSI, MESIF etc are 99% irrelevant to having the right metal model. "Dirtying cache lines across different cores/threads is slow" is most of what you need to know. Within the same core the coherency protocols between levels of cache is not really visible at all to software even as varying performance artifacts.

Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture and making a cross cpu cache miss slightly less slow could just maybe be helped by the detailed understanding of the machine model, but by that time you're already far in the not-giant-turd territory (at least if you're optimizing the right thing).

Of course computer architecture is fascinating and fun to learn about.

DrScientist 6 years ago | | |

Surely the whole point of good system design is a set of logical abstractions, where you need to understand the logical model and not the internal details - as these are free to be evolved.

Of course performance matters, but surely having performance tests, rather than trying to second guess what the whole stack below you might be doing, is

1. more efficient 2. more accurate 3. more likely to detect changes in a timely way.

That's not to say, you shouldn't be curious and deep understanding isn't a good thing.

Just saying understanding inside-out the abstraction you are working with ( eg Java Memory Model ) it's performance characteristics ( from real world testing ) - is more important than some passing knowledge of real world CPU design.

This app I am using right now is in a webbrowser - not sure how understanding cache coherency helps in a single threaded javascript.

pingyong 6 years ago | | |

Eh, idk. Most programs that run like a giant turd do it because they load 15 megabytes of Javascript libraries to call two functions, or something to that effect. Computers are so fast now that you really need to be doing something unbelievably stupid for things in consumer programs to not be instantaneous.

strstr 6 years ago | | |

Most engineers don't write code with hard performance constraints. Game devs probably need to be fighting to get every frame.

For the bulk of the eng I work with the concept of StoreLoad reordering on x86 would be an academic distraction.

devnonymous 6 years ago | | |

I disagree on this. More programmers should understand the performance characteristics of the abstraction layers that they rely on. Else we have the case of jerk programmers who insist on redesigning / rewriting / refactoring to optimise for CPU caches while still running the apps via a bunch of docker containers each based of the centos image to run one simple binary that probably needs only glibc.

jacobush 6 years ago | | |

Maybe but it feels like most are stuck in environments which will do bad things to their cache coherence. It's fine if you are doing some data processing in C. If you are using .NET with a bunch of magic libraries or Javascript or whatever, sure it will help, but to actually make an impact you have to be very careful.

codetrotter 6 years ago | | |

I agree with you Jonathan but am wondering, will Jai help programmers write programs with better cache coherency even if said programmers don’t understand cache coherency well? Or is that orthogonal to the goals of Jai?

voldacar 6 years ago | |

Could you explain what the DEC Alpha did differently here?

It was before my time :)

jcranmer 6 years ago | | |

It's not what DEC Alpha did, but what it didn't do. ;-)

The issue that comes up on Alpha is this code:

  thread1() {
    x = …; // Store to *p
    release_barrier(); // guarantee global happens-before
    p = &x; // ... and now store the p value.
  }
  
  thread2() {
    r1 = p; // If this reads &x from thread1,
    r2 = *r1; // this doesn't have to read the value of x!
  }

The Alpha's approach to memory was to impose absolutely no constraints on memory unless you asked it to. And each CPU had two cache banks, which means that from the hardware perspective, you can think of it as having two threads reading from memory, each performing their own coherency logic. So you can have one cache bank reading the value of p who, having processed all pending cache traffic, saw both the stores, and then you can turn around and request the other cache bank to read *p who, being behind on the traffic, hasn't seen either store yet.

Architectures with only one cache bank don't have this problem. Other architectures with cache banks feel obligated to solve the issues by adding extra hardware to make sure that the second cache bank has processed sufficient cache coherency traffic to not be behind the first one if there's a dependency chain crossing cache banks.

brandmeyer 6 years ago | | |

IMO, the various blogs and tutorials out there that help to make sense of the C++11 memory model make better tutorials than the Linux kernel's own shenanigans.

The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.

https://preshing.com/20140709/the-purpose-of-memory_order_co...

ridiculous_fish 6 years ago | | |

memory-barriers.txt from Linux is a lovely way to be introduced to memory barriers in general, and the Alpha memory model in particular.

https://github.com/torvalds/linux/blob/master/Documentation/...

kqr 6 years ago | |

You make it sound like CPU caches are the only caches around. I deal with higher-level caching a lot, and I'm not writing an OS. Is your book recommendation still useful for me?

dragontamer 6 years ago |

A good, introductory, high-level overview of what is going on with cache coherence... albeit specific to x86.

ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).

Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.

------

I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.

As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

sherincall 6 years ago |

One thing not mentioned here (nor in previous discussions of the article, it seems) is that DMA is typically not coherent with the CPU caches. This is kinda visible from the little diagram at the top, with the disk sitting on the other side of the memory, but it should be explicitly spelled out. If you're using a DMA device (memory<->device or memory<->memory copies), you might end up in a state where the DMA and the CPU see different values. This usually means data transfer to/from a Disk or GPU, though other peripherals might use it too.

Your options here are either to manually invalidate your caches and synchronize with the DMA (e.g. via interrupts), or to request from the OS that the given memory section be entirely uncached; or in some cases, you can get away with a write-through cache policy, if the DMA is only ever reading the memory.

AllanHoustonSt 6 years ago | |

I think DPDK does some user-level trickery to achieve per-core caching through DMA, do you happen to know how they go about it?

lettergram 6 years ago |

For those interested (in 2014!) I did a rather simple analysis of CPU caches and for loops to point out some pitfalls:

https://austingwalters.com/the-cache-and-multithreading/

Hope it helps someone, I tend to link it to my co-workers when they ask me why I PR'd re-ordering of loops & functions OR when they ask how I get speedups without changing functionality.

nemothekid 6 years ago |

I've never heard of the MESI protocol before so that was really interesting to read, and I liked the comparison to distributed systems.

I'm wondering if the MESI protocol could be used in a networked database manner? I feel like you need master node though to coordinate everything though (like the L2 does in the example).

dakom 6 years ago |

Something I don't understand is how to deal with cache coherency when you need the same data in a bunch of different configurations.

Take a typical game loop and assume we have a list of Transforms (e.g. world matrix, translation/rotation/scale, whatever - each Transform is a collection of floats in contiguous memory)

Different systems that run in that loop need those transforms in different orders. Rendering may want to organize it by material (to avoid shader switching), AI may want to organize it by type of state machine, Collision by geometric proximity, Culling for physics and lighting might be different, and the list goes on.

Naive answer is "just duplicate the transforms when they are updated" but that introduces its own complexity and comes at its own cost of fetching data from the cache.

I guess what I'm getting at is:

1) I would love to learn more about how this problem is tackled by robust game engines (I guess games too - but games have more specific knowledge than engines and can have unique code to handle it)

2) Does it all just come out in the wash at the end of the day? Not saying just throw it all on the heap and don't care... but, maybe say optimizing for one path, like "updating world transforms for rendering", is worth it and then whatever the cost is to jump around elsewhere doesn't really matter?

Sorry if my question is a bit vague... any insight is appreciated

yvdriess 6 years ago | |

Assuming that once determined at the start of the frame (e.g. camera position changes after user input handling), the transform matrices are not written to. They can then be freely shared across multiple cores without causing problems with coherency. The cache lines associated with the transform will be set to 'Shared' across all cores. Cache coherency will start to bite you in the ass in this situation if you start mutating the transforms while other threads are reading it, causing cache invalidations and pipeline flushes across all caches owning those lines.

In short, write a transform once and treat it as immutable. Do not reuse the Transform allocation for a good while for subsequent frames to ensure that its cache lines are no longer in cache. If you do need to reuse right away, you can force invalidate cache lines by addresses, so that the single-writer in the next step is the single (O)wner and no other caches need to invalidate anything.

dakom 6 years ago | | |

Thanks - I'll have to do a bit more learning to really understand this, e.g. how mutability relates to cache lines and what "Shared" means in that context, but this gives me some good practical direction and insight to take it further :)

dang 6 years ago |

Discussed last year: https://news.ycombinator.com/item?id=17670095

vagab0nd 6 years ago |

This might be a naive question: how did we decide as an industry that cache should be controlled by the hardware, but registers and main memory by the compiler?

nostrademons 6 years ago |

Anyone else start thinking of Rust's mutable/immutable borrow system when reading the MESI algorithm? It's not quite the same - with Rust, mutable borrows are never shared, and you can never mutate a shared read-only borrow - but the principle seems like a simplification of the full MESI protocol.

It seems like this would be generally applicable for a wide variety of distributed & concurrent applications.

xakahnx 6 years ago | |

Keeping the directory coherent is the difficult part when translating directory-based cache coherence protocols to other distributed systems problems. The directory is like an oracle that sees every transaction in order. This is hard in most network distributed systems problems where you have to worry about availability, network partitioning, or durability of this node.

yvdriess 6 years ago | | |

Indeed, the evolution will probably in the other direction, with the on chip network adopting algorithms from wider networks to deal with scaling problems. DRAM interfaces use to be pretty simple, now they are being trained almost like a DSL line.

tyingq 6 years ago |

AMD's Rome processors are an interesting case, with 8MB of L3 cache per core. So the 64 core processor has 512MB of L3 cache. It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server. An early 90's fridge sized Sun 690MP maxed out at 1GB of DRAM and had 1MB of L2 cache, no L3.

zamadatix 6 years ago | |

Half that - 4 MB per core so the 64 core CPU has 256 MB (dual socket is where the 512 number comes from but that's 128 cores and NUMA).

It's also not fully accessible, each core can only directly access the 16 MB in its group of 4. Everything else is the same as a cross cache read.

tyingq 6 years ago | | |

Ah, yeah. Mixed up their CCD and CCX terms. The 690MP was dual socket though, so still a somewhat valid comparison.

zozbot234 6 years ago | |

> It wasn't that long ago that 512MB was a respectable amount of DRAM in a big server.

Whereas today, 512MB is a bare minimum amount of DRAM in a general-purpose desktop. Times change.

blattimwind 6 years ago | | |

I don't think you can run a modern Linux or Windows 10 desktop on 512 MB RAM. Even my slim Linux desktop (no DE) consumes about 400 MB of RAM after login. Web-browsing with less than ~2 GB of memory doesn't seem feasible.

musicale 6 years ago |

> “different cores can have different/stale values in their individual caches”.

Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.

And what about virtual caches? Non-coherent cache lines?

Moreover, even in the face of cache coherency you can still have race conditions.

gpderetta 6 years ago | |

> Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.

what do you mean? Either two caches agree on the content of a cacheline or one of the cacheline is marked invalid (and the stale content is irrelevant). There are components of a core that might not respect coherency, like load and store buffers and arguably registers, but not caches (on cache-coherent systems of course).

Virtually addressed caches are an issue and that's why they have fallen out of favor.

praptak 6 years ago |

The one-line summary seems to be that one should never worry about caches themselves introducing concurrency bugs.

I mean after we account for memory operations reordering on each core, the memory address storing a single value that is visible to all cores is a correct model from the concurrency-correctness point of view, right?

Tuna-Fish 6 years ago | |

On x86, the correct model is that on any core, all reads are in order, all writes are in order, and no write will ever be moved earlier than a read on that core.

Or in other words, the only kind of visible reordering that is allowed to occur is that writes can be delayed past reads.

An example of a situation where this is significant:

    thread 1       thread 2
    mov [X], 0     mov [Y], 0
    mov [X], 1     mov [Y], 1
    mov r1, [Y]    mov r2, [X]

After this sequence of code, r1 == r2 == 0 is legal. (As is any other combination of 1 and 0.)

(edit:) And just to add, all this reordering is of course impossible to detect on just one core, as when a read request hits a recent write on the same core, it reads it out of the store queue. This can sometimes be really bad for performance, though, as if you read a value that is partially in the store queue (such as, write 16-bit value to x, the immediately read 32 bits from x), some cpus will stall that read, and all that follow it, until the entire store queue is flushed. Since the store queue can easily take tens if not hundreds of cycles to clear, this can be very expensive.

blattimwind 6 years ago |

It's worth pointing out that the L1 cache and its associated logic is the only* way a core talks to the outside world, including all I/O ever. With that in mind it is easy to understand why it is so crucial to performance.

* there might be some minor exceptions

wildmanx 6 years ago |

The biggest myth is that any of this matters to anybody but a tiny fraction of niche programmers.

The reason some "app" is slow is not because of cache coherence traffic. It's because somebody chose the wrong data structure, created some stupid architecture, wrote incomprehensible code that the next guy extended in a horribly inefficient way cause they didn't understand the original. My web browsing experience is slow because people include heaps of bloated JS crap and trackers and ad networks so I have to load 15 megabytes of nonsense and wait for JS to render stuff I don't want to see. None of this was any better if anybody involved understood CPU caches better.

Even in the kernel or HPC applications, most code is not in the hot path. Programmers should rather focus on clean architectures and understandable code. How does it help if you hand-optimize some stuff that nobody understands, nobody can fix a bug, nobody can extend it. That's horrible code, even if it's 5% faster than what somebody else wrote.

TL;DR: This is interesting, but likely totally irrelevant to your day job. In the list of things to do to improve your code, it comes so far down that you'll never get there.

gchokov 6 years ago |

Half a decade - woow! Sounds like the author has spent half a century there..

johnthescott 6 years ago |

the entire design of unix is realized in a moment when a motherboard is seen as a network.

1e1f 6 years ago |

Should include tl;dr your concurrency fears are real, but for registers and not caches.

simpsond 6 years ago | |

If you have two threads reading then writing values in memory, you still need synchronization/atomic changes at the software level.

mrich 6 years ago |

Note that this is quite specific to x86, on other architectures like Power there are much weaker guarantees that will lead to problems when assuming the same model.

dspillett 6 years ago | |

So another myth for this list is "caches are at all similar between architectures"?

gpderetta 6 years ago | |

which part is x86 specific?

mrich 6 years ago | | |

To quote from https://fgiesen.wordpress.com/2014/07/07/cache-coherency/

"Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model."