A bug that doesn’t exist on x86: Exploiting an ARM-only race condition

A bug that doesn’t exist on x86: Exploiting an ARM-only race condition(github.com)

291 points by stong1 4 years ago | 134 comments

anyfoo 4 years ago |

Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]

I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.

[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.

gpderetta 4 years ago | |

But Peterson's algorithm requires explicit memory barriers even on x86, it doesn't seem the best example to show the difference.

anyfoo 4 years ago | | |

Here are my slides from back then: https://reinference.net/mp-talk.pdf

You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).

mwcampbell 4 years ago | |

> Don't try to avoid synchronization primitives, or even invent your own.

Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.

cesarb 4 years ago | | |

Unless you're the maintainer of the parking_lot crate, you're not "inventing your own". And since parking_lot is AFAIK the second most popular implementation of mutexes and RW locks in Rust (the most popular one being obviously the one in the Rust standard library, which wraps the OS-provided lock implementations), you can assume it's well tested.

stjohnswarts 4 years ago | |

I doubt that. The number of ARM processors is far greater in reality than in x86 if we clarify it by saying “in operation” rather than historically and these stories will become more common but certainly won't see a “sharp increase”.

codeflo 4 years ago | | |

This sort of bug only happens when running a multithreaded program (with shared memory) on a multicore processor.

You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.

My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.

retrac 4 years ago | | |

Of course. Any such flaws in the Linux kernel or any library used by Android should have been found by now, for example. But the number of ARM processors running developer/server/desktop stacks has been tiny until recently. In my experience, quite a lot of Linux on desktop software fails to even build on non x86_64 machines.

SavantIdiot 4 years ago | | |

The dominant Arm core in the world is a Cortex-M (or Cortex-R) which are single-core. They are 99% of the time on a die with far less <512K SRAM, and run an RToS or baremetal.

These outnumber x86+Cortex-A by probably a factor of 1,000.

beebmam 4 years ago |

Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.

I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.

vitus 4 years ago |

I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.

My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.

Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.

That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.

(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)

One other comment: the explanation in [1] is slightly incorrect:

> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.

We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.

[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...

[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...

PaulDavisThe1st 4 years ago |

Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.

Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.

pcwalton 4 years ago |

Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.

nyanpasu64 4 years ago | |

IMO lock-free MP or MC algorithms are harder to get right than SPSC structures (atomics for shared memory, queues for messaging, triple buffers for tear-free shared memory). But even SPSC algorithms can be tricky; I've found the same (theoretical) ordering error in three separate Rust implementations of triple buffering (one of them mine), written by people who've already learned the ordering rules (which I caught with Loom). And initially learning to reason about memory ordering is a major upfront challenge too.

ohazi 4 years ago | | |

I particularly like lock-free (wait-free?) SPSC queues because they're (relatively) easy to get right, and are extremely useful for buffering in embedded systems. I end up with something like this on almost every project:

One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.

An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].

[1] https://github.com/jamesmunns/bbqueue

[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...

reitzensteinm 4 years ago | | |

I'd be interested in knowing the details of the error!

platinumrad 4 years ago | |

There's no new invention in here. Just an (intentional) misuse of "volatile".

xxs 4 years ago | |

There are tons of lock-free algorithms, both node based and array backed up. Lock-free is notoriously easier on garbage collector set-ups, of course.

PaulDavisThe1st 4 years ago | |

This isn't a new lock-free algorithm. Single-reader, single-write FIFOs are one of the oldest approaches around.

They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.

reitzensteinm 4 years ago |

For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:

https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/

Next step are some lock free queues, although I haven't gotten around to publishing them!

Azsy 4 years ago |

Have i told you about our lord and savior Rust?

Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.

nyanpasu64 4 years ago | |

Rust doesn't catch memory ordering errors, which can result in behavioral bugs in safe Rust and data races and memory unsafety in unsafe Rust. But Loom is an excellent tool for catching ordering errors, though its UnsafeCell API differs from std's (and worse yet, some people report Loom returns false positives/negatives in some cases: https://github.com/tokio-rs/loom/issues/180, possibly https://github.com/tokio-rs/loom/issues/166).

tialaramex 4 years ago | | |

> which can result in behavioral bugs in safe Rust

For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.

However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.

CodesInChaos 4 years ago | | |

It doesn't catch all of them. But data-races on plain memory access are impossible in safe rust.

And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).

Fiahil 4 years ago | | |

I think it's fixable, the main reactor is what matters. You can add or remove as many synchronisation primitive as you like.

Other tooling, like Jepsen, will interact with your program at a higher level.

0xfaded 4 years ago |

My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.

silisili 4 years ago |

> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.

Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.

krylon 4 years ago | |

Heraclitus, mumbling into his beard: "Told you so!"

SCNR

vlovich123 4 years ago | |

A hundred times is not that much except for really cold code paths. It’s probably in the billions if not more and I have to imagine that software level effects typically swamp HW-level effects here. That’s why you see software typically having a performance deviation no greater than ~5-10% unless you’re running microbenchmarks.

agalunar 4 years ago |

Great write-up!

There may be a typo in section 3:

> It will happily retire instruction 6 before instruction 5.

If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").

colejohnson66 4 years ago | |

You are correct. The retire unit ensures that all micro ops are retired in order

stong1 4 years ago | |

Nice catch. I fixed it. I should have said "execute" rather than "retire".

gpderetta 4 years ago |

The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.

tl;dr: just use std::atomic.

[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity

secondcoming 4 years ago |

There is a proposal (possibly accepted) to deprecate 'volatile' in C++.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...

half-kh-hacker 4 years ago |

this slaps. I always see perfect blue a few places above us!

nyanpasu64 4 years ago | |

Context for downvoters: "perfect blue" is the CTF group writing this article, and "a few places" means CTF team rankings in competitions.

cookiewill 4 years ago |

Is it normal for the .got.plt section to be writable rather than read-only?

amelius 4 years ago |

Does the race condition exist when emulating x86 on Apple M1?

mmwelt 4 years ago | |

Apple added hardware support for x86 memory semantics.

https://news.ycombinator.com/item?id=28731534

https://mobile.twitter.com/ErrataRob/status/1331735383193903...

saagarjha 4 years ago | |

No. Rosetta emulates TSO correctly.

addaon 4 years ago | | |

To draw together the two answers here to the original question.

1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.

2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.

sydthrowaway 4 years ago |

Any good references on low level details on ARMv8+?

im3w1l 4 years ago |

And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.

drcongo 4 years ago |

Nice try Intel.

struct Foo { // lots of stuff here ... }; struct A { Foo* f = new Foo; ~A() { delete f; } }; struct B { std::unique_ptr<Foo> f = std::make_unique<Foo>(); // no need to define a dtor; the default dtor is fine };

#include <stdio.h> int main(int argc, char **argv) { int i; double sample, sum; double kahan_y, kahan_t, kahan_c; // initial values sum=0.0; sample=1.0; // start with "large" value for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values // Kahan summation algorithm kahan_y=sample - kahan_c; kahan_t=sum + kahan_y; kahan_c=(kahan_t - sum) - kahan_y; sum=kahan_t; // pre-load next small value sample=1.0E-20; } printf("sum: %.15f\n", sum); }