Programming Language Memory Models

Programming Language Memory Models(research.swtch.com)

180 points by thinxer 5 years ago | 97 comments

raphlinus 5 years ago |

A GPU followup to this article.

While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].

It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.

We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.

[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...

[2]: https://rigtorp.se/spinlock/

raphlinus 5 years ago | |

Also: it remains difficult to fully nail down the semantics of sequential consistency as well, especially when it's mixed with other memory semantics. Very likely next time Russ updates his article he should add a reference to Repairing Sequential Consistency in C/C++11[1].

[1]: https://plv.mpi-sws.org/scfix/full.pdf

rsc 5 years ago | |

Thanks for the GPU insights and links (and the paper link below)!

I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?

spinlocker 4 years ago | | |

I believe your claim is correct: https://news.ycombinator.com/item?id=27758461.

rigtorp 5 years ago | |

There's even more discussion on the lock memory ordering on Stackoverflow: https://stackoverflow.com/questions/61299704/how-c-standard-...

Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.

rsc 5 years ago | | |

That is an amazing thread. The fact that C++ apparently allows optimizing

    #include <stdio.h>
    
    int stop = 1;
    
    void maybeStop() {
        if(stop)
            for(;;);
    }
    
    int main() {
        printf("hello, ");
        maybeStop();
        printf("world\n");
    }

into

    int main() {
        printf("hello, world\n");
    }

(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?

Maybe there is a reason, but none of this inspires confidence.

spinlocker 5 years ago | |

> I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.

raphlinus 5 years ago | | |

https://doc.rust-lang.org/std/sync/atomic/struct.AtomicU32.h...

It's accepted by the compiler, but if provided, it compiles to a panic.

dragontamer 5 years ago | |

GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group.

Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.

---------

From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).

--------

So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.

Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.

Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.

raphlinus 5 years ago | | |

In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity.

In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.

wcarss 5 years ago |

The prior article in this series from ~a week ago is 'Hardware Memory Models', at https://research.swtch.com/hwmm, with some hn-discussion here: https://news.ycombinator.com/item?id=27684703

Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/

a few previous hn discussions of that one:

https://news.ycombinator.com/item?id=17099608

https://news.ycombinator.com/item?id=27455509

https://news.ycombinator.com/item?id=13293290

electricshampo1 5 years ago |

"Java and JavaScript have avoided introducing weak (acquire/release) synchronizing atomics, which seem tailored for x86."

This is not true for Java; see

http://gee.cs.oswego.edu/dl/html/j9mm.html

https://docs.oracle.com/en/java/javase/16/docs/api/java.base...

dragontamer 5 years ago | |

Its not true in general. x86 CANNOT have weak acquire/release semantics. x86 is "too strong", you get total-store ordering by default.

If you want to test out weaker acquire/release semantics, you need to buy an ARM or POWER9 processor.

rsc 5 years ago | | |

ARMv7 or earlier it appears. On ARMv8 with direct hw support for SC atomics, the SC atomics are the suggested implementation of acq/rel too. See the ARMv8 section of https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.

As I mentioned in the post (https://research.swtch.com/plmm#sc), Herb Sutter claimed in 2017 that POWER was going to do something to make SC atomics cheaper. If it did, then that might end up being cheaper than the old sync-based acq/rel too, same as ARM, in which case we'd end up with SC = acq/rel on both ARM and POWER. It looks like that didn't happen, but I'd be very interested to know what did, if anything.

gpderetta 5 years ago | | |

I would say that acquire/release map very well to x86 (were they are free). Technically x86 is slightly stronger as it doesn't allow IRIW, but seq cst is too expensive to implement by default.

Conversely acq/rel are from somewhat to very expensive to implement on ARM/POWER.

knz42 5 years ago |

A lot of the complexity comes from the lack of expressivity in languages to relate variables (or data structure fields) semantically to each other. If there was a way to tell the compiler "these variables are always accessed in tandem", the compiler could be smart about ordering and memory fences.

The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.

Here's an idea for ISA primitives that could help a language group variables together: bind/propagate operators on (combinations of) address ranges. https://pure.uva.nl/ws/files/1813114/109501_19.pdf

smasher164 5 years ago | |

Even with that expressivity, someone who incorrectly relates or forgets to relate two variables could experience the same issues. It's still important to address what happens when the program has data races or when it is data-race-free but the memory model permits overreaching optimizations. The language and implementation should strive to make a program approximately correct.

dragontamer 5 years ago | |

That's Java's Object.lock() mechanism.

All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).

--------

The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.

karmakaze 5 years ago | | |

In Java, any object can be used to synchronize any data, e.g.

  synchronized(foobar_object){ foo(); }
  synchronized(foobar_object){ bar(); }
  synchronized(foobar_object){ baz(); }

Will have foo, bar, baz methods well behaved in any data that they share regardless of whether they are foobar methods or methods of any other class(es). It is exactly analogous to the S(a) -> S(a) synchronizing instruction from the article that establishes a happens-before partitioning each thread into before/after the S(a).

The only time synchronized(explicit_object) relates to anything else is when also using the keyword where `synchronized void foo()` is equivalent (with a minor performance difference) to `synchronized(this) { ... }` wrapping the entire body of the foo method.

mahmoudimus 5 years ago |

Fascinating article. I've been doing research in this area and I wonder if there was exploration for JinjaThreads - which operate on Jinja (a Java-like language) that does a formal DRF proof guarantee (coincidentally using Isabelle/HOL).

You can read more about this here if you're interested: https://www.isa-afp.org/entries/JinjaThreads.html

romesmoke 5 years ago |

I'm wondering: is the fact that a CS PhD finds resources like this as much amusing as educational/pedagogical gold telling something for the Academia, the Culture, or the Self?

AKA why can't I stumble upon such stuff more often. Thanks OP!

jqpabc123 5 years ago |

If thread 2 copies done into a register before thread 1 executes, it may keep using that register for the entire loop, never noticing that thread 1 later modifies done.

Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.

In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".

voidnullnil 5 years ago |

These "memory models" are too complex for languages intended for dilettante developers. It was a disaster in Java/C#. Not even more than a handful of programmers in existence know in depth how it works, as in, can they understand any given trivial program in their language. At best they only know some vague stuff like that locking prevents any non visibility issues. It goes far deeper than that though (which is also the fault of complex language designs like Java and C#).

The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.

Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid. Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".

Edit: forgot spaces between paragraphs.

bullen 5 years ago |

In a 100 years the main languages used will still be C on the client (with a C++ compiler) and Java on the server.

Go has no VM but it has a GC. WASM has a VM but no GC.

Eveything has been tried and Java still kicks everythings ass to the moon on the server.

Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.

"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html

"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...