I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.
[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.
You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).
Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.
You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.
My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.
These outnumber x86+Cortex-A by probably a factor of 1,000.
I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.
My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.
Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.
That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.
(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)
One other comment: the explanation in [1] is slightly incorrect:
> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.
We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.
[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...
[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...
Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.
You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.
One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.
An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].
[1] https://github.com/jamesmunns/bbqueue
[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...
They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.
https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/
Next step are some lock free queues, although I haven't gotten around to publishing them!
Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.
For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.
However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.
And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).
Other tooling, like Jepsen, will interact with your program at a higher level.
Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.
SCNR
There may be a typo in section 3:
> It will happily retire instruction 6 before instruction 5.
If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").
tl;dr: just use std::atomic.
[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...
https://news.ycombinator.com/item?id=28731534
https://mobile.twitter.com/ErrataRob/status/1331735383193903...
1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.
2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.
Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).
Even on x86 the most common instruction you use is LOCK cmpxchg.
Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.
People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.
Indeed. It's not safe under x86 either.
As a naive practitioner of modern C++, I'd love it if you could elaborate on this.
Using unique_ptr/make_unique() or shared_ptr/make_shared() automates lifetime management (obviates the need for a manual 'delete') and makes the ownership policy explicit. They also have appropriately defined copying behavior. For example:
struct Foo {
// lots of stuff here ...
};
struct A {
Foo* f = new Foo;
~A() { delete f; }
};
struct B {
std::unique_ptr<Foo> f = std::make_unique<Foo>();
// no need to define a dtor; the default dtor is fine
};
For the destructor and the default constructor, compilers will generate basically identical code for both A and B above. If you try to copy a B, the compiler won't let you because unique_ptr isn't copyable. However it won't stop you from copying an A, even though as written (using the default copy ctor) that's almost certainly a mistake and will likely result in a double free in ~A().unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.
Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.
A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
Of course -O3 might not be any faster anyway but that's another topic.
Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?
Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.
TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).
See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).
Volatile only prevents reordering of volatile statements (and IO), not all load and stores.
[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...
> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.
From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.
[1] see the linux kernel implementation of load acquire and store release on x86 for example.
But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.
The purpose of abolishing volatile isn't so much to reinforce that it's not intended for this sort of threading nonsense (indeed on Windows the MSVC guarantees mean it almost is intended for this sort of nonsense) but to make it explicit that "volatile variables" were never really a thing anyway by abolishing the volatile qualifier on variables.
The thing your hardware can actually do is almost exactly: https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and https://doc.rust-lang.org/core/ptr/fn.write_volatile.html
And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.
With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing. It means nothing, it's gibberish, but you can do it and people do. They presumably thought it meant something and since it doesn't they were wrong. So, deprecate this and maybe they'll go read up on the subject.
You can also in C++ declare things that clearly aren't in the least bit volatile, as volatile anyway. C++ has volatile member variables, volatile member functions, volatile parameters... Any code that seems to rely on this probably doesn't do what the people who wrote it thought it does, run away.
It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that. Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'. Does that also not make any sense to you? If it does, can you explain why you believe they are different?
Volatile member functions and parameters make no sense, but volatile member variables most certainly do. And there is considerable pushback in the C++ community because this is a significant loss of compatibility with various C-headers used frequently in embedded applications. I wouldn't be surprised if the deprecated features will be reinstated in the language in the end.
I do sort of miss having a basic volatile (although I can write my own type somewhat effectively) just for benchmarking's sake sometimes.
> You can also select multi-core settings, as shown here... These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core.
https://news.ycombinator.com/item?id=28732273
zamadatix's interprets this as Microsoft saying that by default, Windows on ARM runs x86 apps without x86 TSO, and turns on extra memory barriers using per-app compatibility settings. But if an app needs TSO but isn't in Windows's database, it will crash or silently corrupt data.
Because on x86 it is, no special barriers or instructions necessary.
mov [shared_data], 1
mov [release_flag], 1
That said, heuristics are used to speed it up. I would recommend not sharing values in the stack between threads for synchronisation for example.
(Which is better? I don’t know.)
Nearly everything in a modern processor is a source of reordering, from branch prediction to basically everything in the OoO backend. Any time you leave the core, there's reordering happening in the network. And yes, that includes caches, which involve a heavy amount of inter-core communication. When you issue two successive loads to different cache lines, which one is going to return first?
The OoO backend itself manages hazards and ensures that ld/st instructions are retired in the correct order to maintain the processor's memory consistency model. Software can build on top of that, e.g. with fences, to impose stricter consistency models.
edit: to clarify, I claim that the coherency layer (i.e the L1 cache and beyond), does not introduce any reordering issues, at least for common cpus using variants of MESI.
But if you want to make a lock by all means make a lock, just don't go and reinvent the chip architecture...
People who take on task of writing such library, and develop it to the point it's a language-wide standard, usually know what they're doing (or learn on the job :)
Popularity of such library helps test it thoroughly on many platforms in various conditions, so there's a high chance the bugs will be spotted and fixed.
It's more like "if you don't need to, don't invent your own [x]." People who like to invent [x] are usually smart enough to understand why that warning is there to begin with, and don't tend to argue with it.
It's "be competent before you invent it, because it's hard". And if you aren't, then let someone who is do the inventing.
so yes - invent your own synchronization primitives. please.
just dont believe they are correct without being serious about trying to prove they are. and dont hold up your whole project for self-enrichment.
but try to layer as much in as you can.
developers these days are so productive, until they fall down and cant get up. and then they are completely useless.
But that's just bad for running general purpose software that can require more memory than available locally since it means you have to do memory cache management in software which is going to be much slower than letting the hardware do it.
(on ARMv8.0 where you don’t have those, barriers are used more)
TSO pessimization is the only way to make the thing work at a translation time cost that isn’t too high.
If you just want normal variables, write a normal variable, J F. Bastien's change doesn't have any consequences for your normal variables.
> Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'
No. That's the post-increment operator so i++ has the same value as i did before the increment, whereas i = i + 1 has the value of i after the increment.
And you might say "Oh, but I never use those values". Good. J.F. Bastien's change forbids that for volatiles too, we're on the same page.
What you apparently want is to do two volatile operations, one read and one write, sandwiching an arbitrary integer operation. You can do that explicitly of course, Bastien's change doesn't alter that - but when you write it out, doubt appears on the horizon. If these are two operations, can't I race with somebody else such that they modify this thing before I write my modified version back and then I overwrite their changes?
And that doubt should be there. But what we know from asking people is that the compound operators falsely reassure them. That's why they were deprecated and why it's sad that there's enthusiasm to bring them back.
[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p232...
But the face saving is on the part of embedded C++ proponents who'd rather never fix the C++ language than risk that the mundane C code cheap vendors actually ship can't be counted as "C++ support" because it's no longer valid C++.
That's the core of the argument in P2327. Not that this isn't a bad idea (it's obviously a bad idea) or that we shouldn't do better (alternatives to C++ already do better) but that this is the status quo and C++ can't expect to improve upon that. A sad state of affairs.
P2327's dismissal of the problem caused by UART1−>UCSR0B |= (1<<UCSZ01 ); comes down to the usual Real Programmer stance, surely every single embedded programmer will arrange "by construction" that the problem can't happen. No actual examples were examined to see if embedded developers reliably do that, which seems unlikely - we're just invited to imagine that they're all inerrant.
Some people argue for a similar mechanism, something like 'language "C++20" { .. }', that would allow a program to opt in to changes that would otherwise be breaking changes; mostly new keywords. However, changing actual language rules in such a block would be tricky, to say the least.
The differences between arm and x86 are known for 15y+, there is nothing new about it. Also concurrency support is one of the major benefits of languages with proper memory model - java started it with JMM[0]
[0]: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedL...
The JVM absolutely did some great work walking this path, both in defining a memory model in the first place, and supporting that model on weak and strong hardware memory models, but the JMM was specifically designed to be able to run cleanly on WMO platforms to begin with (early Sparc), so they don't face a lot of the same problems discussed here.
Memory models and JVM are not really relevant when discussing running binaries for a different architecture.
If there is a JIT I'd expect to be able to add a read barriers, on memory location allocated by another thread - incl. allocating bits in the pointers and masking them off on each dereference. If any block appears to be shared - the code that allocated it would need to be recompiled with memory store-store barriers; the reading part would need load-load and so on. There are quite a few ways to deal with the case, aside the obvious - make the hardware compatible.
If in end it's not an easy feat to make up for the stronger memory model correctness, yet correctness should be a prime goal of an 'emulator'
MSVC configures the FPU to use 64 bit precision which means that double words fine, but it has no 80 bit long double and float still suffer from excess precision.
SSE avoid all these problems.
The relevant code is:
kahan_y=g_sample_z - kahan_c;
kahan_t=g_sample_z_sum + kahan_y;
kahan_c=(kahan_t - g_sample_z_sum) - kahan_y;
g_sample_z_sum=kahan_t;
(this is in an inner loop where a new g_sample_z is calculated and then added to a running g_sample_z_sum with this snippet)Changing the three kahan_* variables to volatile makes this work (slowly) with -Ofast.
#include <stdio.h>
int main(int argc, char **argv) {
int i;
double sample, sum;
double kahan_y, kahan_t, kahan_c;
// initial values
sum=0.0;
sample=1.0; // start with "large" value
for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values
// Kahan summation algorithm
kahan_y=sample - kahan_c;
kahan_t=sum + kahan_y;
kahan_c=(kahan_t - sum) - kahan_y;
sum=kahan_t;
// pre-load next small value
sample=1.0E-20;
}
printf("sum: %.15f\n", sum);
}Something like:
[[gnu::optimize("no-associative-math")]]
double kahanSummation() {
...
}
That way the compiler applies all the optimizations it can but only turns off associative math. This should work on Clang & GCC & be net faster in all cases.This is what I mean by "If you're sprinkling volatile around, you probably aren't doing what you want" and are just cargo culting bad advice.
[1] https://stackoverflow.com/questions/26266820/in-clang-how-do... [2] https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attrib...
- For low level hardware grovelling, you need a sufficient understanding of the workings of modern chip internals.
- For crypto, another "don't invent your own" thing, probably a specialised degree in cryptography and a lot of practical experience.
The bar is simply higher than "I have access to a programming language and think my idea would work".
but I have no formal education. and when I started there were no gatekeepers. I mean there were, the MIT grads got to work on much cooler stuff than everyone else.
since then I've collaborated on chip designs and written a metric shitton of low level software - including microcode and alot of parallel synchronization primitives. and when I work with new grads (incl phds), I don't expect them to know any of that really - and we teach them. some of them get it and some of them don't.
not only do I find your emphasis on the 'proper training' misguided - I think its counter productive. unless you're working with a very rare professor, they aren't going to really have a very adequate notion of what new designs are like at all. most of them have had only the most casual experience with industry. your utility to me as a future systems programmer has alot more to do with your general level of talent and your interest in the topic than whether or not someone made you do dining philosophers.