Go’s race detector has a mutex blind spot

Go’s race detector has a mutex blind spot(doublefree.dev)

79 points by GarethX 342 days ago | 57 comments

thwarted 338 days ago |

The race detector has always only worked at run time, and is documented to detect concurrent memory accesses. This means the memory has to actually be accessed in order for it to see the race condition. It does not do static analysis.

https://go.dev/blog/race-detector

> Because of its design, the race detector can detect race conditions only when they are actually triggered by running code, which means it’s important to run race-enabled binaries under realistic workloads.

This isn't a mutex blind spot. It is a side effect of goroutine/thread scheduling, which will obviously be based on workload and other factors. There are a bunch of other cases, that are not mutex related, that it won't see unless execution actually triggers them.

Jyaif 340 days ago |

I always run my Go code with `-race`, but I feel more comfortable writing C++ multithreaded code than Go thanks to the thread sanitizer annotations ( `__attribute__((guarded_by(guard)))` and others in the family).

The annotation also help me discover patterns, like when most of the functions of a class have the same annotations, maybe it means that all the functions of the class should have the same annotations.

I really wish an equivalent to those annotations came to Go.

TheDong 340 days ago |

You're using Go's race detector wrong if you expect it to actually catch all races. It doesn't, it can't, it's a best effort thing.

The right way to use the go race detector is:

1. Only turn it on in testing. It's too slow to run in prod to be worth it, so only in testing. If your testing does not cover a use-case, tough luck, you won't catch the race until it breaks prod.

2. Have a nightly job that runs unit and integ tests, built with -race, and without caching, and if any races show up there, save the trace and hunt for them. It only works probabilistically for almost all significant real-world code, so you have to keep running it periodically.

3. Accept that you'll have, for any decently sized go project, a chunk of mysterious data-races. The upstream go project has em, most of google's go code has em, you will to. Run your code under a process manager to restart it when it crashes. If your code runs on user's devices, gaslight your users into thinking their ram or processor might be faulty so you don't have to debug races.

4. Rewrite your code in rust, and get something better than the go race detector every time you compile.

The most important of those is 3. If you don't do anything else, do 3 (i.e. run your go code under systemd or k8s with 'restart=always').

klabb3 340 days ago | |

> Rewrite your code in rust, and get something better than the go race detector every time you compile.

Congrats, rustc forced you to wrap all your types in Arc<Mutex<_>>, and you no longer have data races. As a gift, you will get logical race conditions instead, that are even more difficult to detect, while being equally difficult to reproduce reliably in unit tests and patch.

Don’t get me wrong, Rust has done a ton for safety and pushed other languages to do better. I love probably 50% of Rust. But Rust doesn’t protect against logical races, lovelocks, deadlocks, and so on.

To write concurrent programs that have the same standards of testable, composable, expressive etc as we are expecting with sequential programs is really really difficult. Either we need new languages, frameworks or (best case) design- and architectural patterns that are easy to apply. As far as I’m concerned large scale general purpose concurrent software development is an unsolved problem.

TheDong 340 days ago | | |

As a sibling said, Go has all the same deadlocks, livelocks, etc you point out that rust doesn't cover, in addition to also having data-races that rust would prevent.

But, also, Go has way worse semantics around various things, like mutexes, making it much more likely deadlocks happen. Like in go, you see all sorts of "mu.Lock(); f(); mu.Unlock()" type code, where if it's called inside an `http.Handler` and 'f' panics, the program's deadlocked forever. In go, panics are the expected way for an http middleware to abort the server ("panic(http.ErrAbortHandler)"). In rust, panics are expected to actually be fatal.

Rust's mutexes also gate "ownership" of the inner object, which make a lot of trivial deadlocks compiler errors, while go makes it absolutely trivial to forget a "mu.Unlock" in a specific codepath and call 'Lock' twice in a case rust's ownership rules would have caught.

In practice, for similarly sized codebases and similarly experienced engineers, I see only a tiny fraction of deadlocks in concurrent rust code when compared to concurrent go code, so like regardless that it's an "unsolved problem", it's clear that in reality, there's something that's at least sorta working.

CodeBrad 340 days ago | | |

I may be biased, as I definitely love more than 50% of Rust, but Go also does not protect against logical races, deadlocks, etc.

I have heard positive things about the loom crate[1] for detecting races in general, but I have not used it much myself.

But in general I agree, writing correct (and readable) concurrent and/or parallel programs is hard. No language has "solved" the problem completely.

[1]: https://crates.io/crates/loom

pkolaczk 340 days ago | | |

I wrote plenty of concurrent Rust code and the number of times I had to use Arc<Mutex> is extremely low (maybe a few times per thousands lines).

As for your statement that concurrency is generally hard - yes it is. But it is even harder with data races.

catigula 340 days ago | | |

If it's solved the solution has been discarded at some point by other developers for being too cumbersome, too much effort, and therefore in violation of some sacred principle of their job needing to be effortless.

ViewTrick1002 340 days ago | | |

A well formed Go program would have the same logical race conditions to manage as well.

The Arc is only needed when you truly need to mutably share data.

Rust like Go has the full suite of different channels and what other patterns to share data.

empath75 340 days ago | | |

> Congrats, rustc forced you to wrap all your types in Arc<Mutex<_>>, and you no longer have data races.

Or you can just avoid shared mutable state, or use channels, or many of the other patterns for avoiding data races in Rust. The fun thing is that you can be sure that no matter what you do, as long as it's not unsafe, it will not cause a data race.

mr90210 340 days ago | | |

> Congrats, rustc forced you to wrap all your types in Arc<Mutex<_>>

Also, don’t people know that a Mutex implies lower throughput depending on how long said Mutex is held?

Lock-free data structures/algorithms are attempt to address the drawbacks of Mutexes.

https://en.wikipedia.org/wiki/Lock_(computer_science)#Disadv...

ViewTrick1002 340 days ago | |

The data race patterns in Go article from Uber is always a scary read.

https://www.uber.com/blog/data-race-patterns-in-go/

stouset 338 days ago | | |

This is more fuel for my thesis that every single feature of golang was considered in isolation, and zero thought was put into how any of them would work together.

I’m not sure how else you can explain perfectly idiomatic code (a loop, or a reused err variable, or a closure) causing a program to fall on its face simply by dropping in go’s namesake feature, whose entire purpose was supposed to be that you could simply drop it in.

To actually use `go` you have to do minor contortions like always remembering to copy your loop variables, make new error variables, and also not accidentally capture any external variables in a closure.

Go doesn’t actually help you with any of this, of course. You just have to remember to do it right every single time. None of these things are hard (usually), but the fact that you have to do them at all speaks volumes about the amount of forethought that went into it.

And of course doing all of those steps doesn’t save you if one of the things you tried to copy secretly contains a pointer inside of it, like absolutely everything in golang does. You didn’t know, and it wasn’t even a public member so it wasn’t in the docs. But there was a pointer somewhere deep inside the thing you copied so now you’ve got unguarded concurrent mutation of shared memory.

aleksi 340 days ago | |

> It's too slow to run in prod to be worth it

I disagree there. It is reasonable to run a few service instances with a race detector. I have a few services where _all_ instances are running with it just fine.

franticgecko3 340 days ago | |

> Have a nightly job that runs unit and integ tests

Not enough IMHO.

We run all tests on developer machines and CI with -race. Always.

It's probabilistic, so every developer 'make test' and every 'git push' is coverage.

onionisafruit 340 days ago | |

I configure ci to run tests with -race and that works out pretty well. I value short ci runs, so testing with -race is a sacrifice for me even if it only adds ~10 seconds typically. I like your idea of a regular job that runs without caching, but your best tip is gaslighting users. Maybe I should start prefixing error messages with “look what you made me do”.

Xeoncross 340 days ago |

I'm so glad to be out of the dark ages of parallelism. Complaining about Go's race detector or exactly which types of logical races Rust can't prevent is such a breath of fresh air compared to all those other single-core languages we're paid to write with that had threading, async, or concurrency bolted-on as an afterthought.

I can only hope Go and Rust continue to improve until the next language generation comes along to surpass them. I honestly can't wait, things improved so much already.

reactordev 340 days ago |

    if id == 1 {
        counter++;
    }

Found your problem. /s

In all honesty, if you “do work” using channels then all your goroutines are “thread safe” as the channel keeps things in order. Also, mutex is working as intended. As you see in your post, -race sees this, it’s good. Now have one goroutine read from a chan, get rid of the mutex, all other goroutines write to the chan, perfection.