Programming Language Memory Models(research.swtch.com) |
Programming Language Memory Models(research.swtch.com) |
While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].
It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.
Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.
We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.
[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...
I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?
Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.
#include <stdio.h>
int stop = 1;
void maybeStop() {
if(stop)
for(;;);
}
int main() {
printf("hello, ");
maybeStop();
printf("world\n");
}
into int main() {
printf("hello, world\n");
}
(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?Maybe there is a reason, but none of this inspires confidence.
Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.
It's accepted by the compiler, but if provided, it compiles to a panic.
Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.
---------
From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).
--------
So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.
Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.
Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.
In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.
Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/
a few previous hn discussions of that one:
https://news.ycombinator.com/item?id=17099608
This is not true for Java; see
http://gee.cs.oswego.edu/dl/html/j9mm.html
https://docs.oracle.com/en/java/javase/16/docs/api/java.base...
If you want to test out weaker acquire/release semantics, you need to buy an ARM or POWER9 processor.
As I mentioned in the post (https://research.swtch.com/plmm#sc), Herb Sutter claimed in 2017 that POWER was going to do something to make SC atomics cheaper. If it did, then that might end up being cheaper than the old sync-based acq/rel too, same as ARM, in which case we'd end up with SC = acq/rel on both ARM and POWER. It looks like that didn't happen, but I'd be very interested to know what did, if anything.
Conversely acq/rel are from somewhat to very expensive to implement on ARM/POWER.
The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.
Here's an idea for ISA primitives that could help a language group variables together: bind/propagate operators on (combinations of) address ranges. https://pure.uva.nl/ws/files/1813114/109501_19.pdf
All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).
--------
The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.
synchronized(foobar_object){ foo(); }
synchronized(foobar_object){ bar(); }
synchronized(foobar_object){ baz(); }
Will have foo, bar, baz methods well behaved in any data that they share regardless of whether they are foobar methods or methods of any other class(es). It is exactly analogous to the S(a) -> S(a) synchronizing instruction from the article that establishes a happens-before partitioning each thread into before/after the S(a).The only time synchronized(explicit_object) relates to anything else is when also using the keyword where `synchronized void foo()` is equivalent (with a minor performance difference) to `synchronized(this) { ... }` wrapping the entire body of the foo method.
You can read more about this here if you're interested: https://www.isa-afp.org/entries/JinjaThreads.html
AKA why can't I stumble upon such stuff more often. Thanks OP!
Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.
In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".
The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.
Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid. Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".
Edit: forgot spaces between paragraphs.
Go has no VM but it has a GC. WASM has a VM but no GC.
Eveything has been tried and Java still kicks everythings ass to the moon on the server.
Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.
"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.
Indeed.
Just a reminder to everyone: your pthreads_mutex_lock() and pthreads_mutex_unlock() functions already contain the appropriate compiler / cache memory barriers in the correct locations.
This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.
This is the stuff of cutting edge research right now: its a niche subject. Your typical programmer _SHOULD_ just stick a typical pthread_mutex_t onto an otherwise single-threaded data-structure and call it. Locks work. They're not "the best", but "the best" is constantly being researched / developed right now. I'm pretty sure that any new lockfree data-structure with decent performance is pretty much an instant PH.D thesis material.
-----------
Anyway, the reason why "single-threaded data-structure behind a mutex" works is because your data-structure still keeps all of its performance benefits (from sticking to L1 cache, or letting the compiler "manually cache" data to registers when appropriate), and then you only lose performance when associated with the lock() or unlock() calls (which will innately have memory barriers to publish the results)
That's 2 memory barriers (one barrier for lock() and one barrier for unlock()). The thing about lock-free algorithms is that they __might__ get you down to __1__ memory barrier per operation if you're a really, really good programmer. But its not exactly easy. (Or: they might still have 2 memory barriers but the lockfree aspect of "always forward progress" and/or deadlock free might be easier to prove)
Writing a low-performance but otherwise correct lock free algorithm isn't actually that hard. Writing a lock free algorithm that beats your typical mutex + data-structure however, is devilishly hard.
---------
"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).
That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.
---------
Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.
That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).
Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).
Since as you say, they are very similar, wouldn't it be reasonable to assume for access purposes that they are effectively global?
What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?
Pointers can be used to circumvent most safety measures. If you obscure the access, you should assume responsibility for the result.
It would be nice if sometime we stopped pretending that beginners are too slow to know/understand things and instead faced the fact that their instructors and mentors are bad at teaching.
Also, maybe you are different, but I can only keep so much in my head at a time. If I can keep something simple or abstract it away so I can focus on other details, that doesn't make me a dilletante. It makes me more effective at what I'm actually trying to do.
Source?
The "Exploiting the slice type" section.
From my perspective, Go in the context of serverless programming seems to currently be the best choice for server-side programming.
In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.
I'd love for this to be Crystal: https://crystal-lang.org/
> I haven't seen server work being done in Java in ages.
In the meantime, I've been doing a large amount of Java backend server work for the past 10 years.
What have you built with go that is interesting?
To each its own.
x86 cannot specify a load/store any more relaxed than total-store ordering (which is even "stronger" than acquire/release)
ARM / POWER9 were originally "consume/release". But upon C++11, the agreement was that consume/release was too complicated, and acquire/release model was created instead.
Java was the granddaddy of modern memory models but focused on Seq-Cst (the strongest model: the one that makes "sense" to most programmers). C++ inherited Java's seq-cst, but recognized that low-level programmers wanted something faster: both "fully relaxed" and acq/rel as the two faster ways to load/store.
Also keep in mind that C++11 specifies std::mutex::lock() to have acquire semantics and unlock() to have release semantics on the lock object. In order for std::mutex to actually work the reordering of m1.unlock(); m2.lock(); to m2.lock(); m1.unlock(); must be disallowed. But since m1 and m2 are separate objects m1.unlock() has no happens before relationship with m2.lock(). This seems to be a problem in the C++11 memory model. The arguments I have heard from some WG21 people is that there is no problem since transforming a wellformed terminating program into a non-terminating program is not allowed. I can't find the wording in the C++ standard that asserts this. But oh well, it works right now on gcc/llvm/msvc.
C is maybe the only good programming language invented so far. Java was a failed attempt at improving C. I think we're rapidly converging on the second good programming language, and it's not going to have null pointer exceptions.
I highly advise reading "Java Concurrency in Practice".
Note that future Java primitive classes don't have monitors.
public void myFunction(FooObject o){
o.doSomething();
}
How does the compiler know if "FooObject o" is a singleton or not? That's the thing about the "Singleton" pattern, you have an effective "global-ish" variable, but all of your code is written with normal pass-the-object style.EDIT: If you're not aware, the way this works is that you call myFunction(getTheSingleton());, where "getTheSingleton()" fetches the Singleton object. myFunction() has no way of "knowing" its actually interacting with the singleton. This is a useful pattern, because you can create unit-tests over the "global" variable by simply mocking out the Singleton for a mock object (or maybe an object with preset state for better unit testing). Among other benefits (but also similar downsides to using a global variable: difficult to reason because you have this "shared state" being used all over the place)
In Java, there's no real difference between a singleton and any other object. A singleton is an object that just happens to have a single instance. Practically speaking, they're typically used as a clever design pattern to "work around" Java's lack of language-level support for global variables, so there's that. But I think that that fact might not be relevant to the issue at hand?
The more basic issue is, if you have two different threads concurrently executing `myFunction`, what happens when they're both operating on the same instance of `FooObject`?
No, aside from the fact that the root commenter clearly understands the issue with global variables, but not necessarily singletons.
I'm trying to use the singleton concept as a "teaching bridge" moment, as the Singleton is clearly "like a global variable" in terms of the data-race, but generalizes to any object in your code.
The commenter I'm replying seems to think that global-variables are the only kind of variable where this problem occurs. He's wrong. All objects and all variables have this problem.
Also it's dead simple to write parsers and developer tools which can match open and close braces. Handling `end` with an arbitrary opening token (maybe it's `if <...>`, `while <...>` what have you) is objectively more work for your CPU to work with.
Subjectively, it looks dumb to have code which looks like this:
end
end
end
end
end
endI briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...
I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...
I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.
----------
And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.
It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.
> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives
The CUB-library also states:
https://nvlabs.github.io/cub/structcub_1_1_device_scan.html
>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]
Where [1] is a footnote pointing at the exact paper you just linked.
-----------
> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.
That certainly sounds spinlock-ish. At least that gives me what to look for in the code.
Actually, most practitioner code has bugs from their implicit assumptions that shared variable writes are visible or ordered the way they think they are.
To solve that problem, the practitioner only needs to know that "mutex.lock()" and "mutex.unlock()" orders reads/writes in a clearly defined manner. If the practitioner is wondering about the difference between load-acquire and load-relaxed, they've probably gone too deep.
This is true, but they do not know that. If you do not give some kind of substantiation, they will shrug it off and go back to "nah this thing doesn't need a mutex", like with a polling variable (contrived example).
let x = atomic::AtomicU32::new(0);
x.compare_exchange_weak(
0,
1,
atomic::Ordering::AcqRel,
atomic::Ordering::Relaxed).unwrap();
println!("{}", x.load(atomic::Ordering::Relaxed));and?
> so whatever it accomplishes is irrelevant.
I have no idea what point you are making. _Of course_ there has to be a bug in the code for there to be a buffer overflow vuln. Or are you objecting that they put contrived code to make the race work better (this is the concept of a PoC)? None of the patterns in that code are unlikely in practice.
The original claim was that "Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption." But that's not the whole picture, you have to violate the memory model, too. And that's not interesting, because if you violate the memory model, literally any consequence is fair game.
Maybe your point is (a) it's easy to violate the memory model, and/or (b) bugs that violate the memory model have surprising consequences? I don't agree with (a); the situation can always be improved, but it's easy to spot and fix data races, and Go provides plenty of tooling for that purpose. And I guess I agree with (b) in the basic sense, but that's just a truism, for the reasons stated above.
So here is my argument, maybe those developers should bother to actually learn about what they are trying to do in first place.
But don't dismiss Go so easily; it hits an interesting sweet spot that may not go away any time soon. It's a simple language with a simple spec, so simple that people are complaining it's too simple a language, yet also simple to use thanks to the GC. But also compiled and fast enough.
But most important of all, it's memory safe and not plagued by undefined behaviour.
Soon (already?) security will mean real money and life or death situation for companies; keeping that much code in a language where nobody can promise a memory corruption will not be introduced in the next commit, is eventually not going to be considered acceptable anymore.
yes, Go is sponsored by a mega corp, yes, and some people cringe at that, but realistically it's less a walled garden than c#, swift or stuff like that.
Rust is likely going to fill the niche currently occupied by C++ but it's quite hard to learn and use.
So yes, it's quite possible that we'll all flock to something new and shiny in 10 years time and forget Go before 20 years have passed. But, whatever replaces Go needs to fill its niche, which if you think about it doesn't have that much free design space left; yes you can improve a few things here and there, but then you have to fight with the massive code base and libraries, that stays relevant due to the absolutely fantastic backward compatibility promises. I've seen C code rot due to compilers getting "better" over time (yes, sure, the C code in question was obviously "wrong", but nobody noticed, because writing correct C code is an exercise in divination)
It is now where C++ was in the early-1990's.
[1] https://docs.oracle.com/javase/specs/jls/se8/html/jls-8.html...
Any consequence is _not_ fair game. "Memory Models" only involve stuff like tossing out sequential consistency [1]. They never say or imply something like "if you have a data race, anything can happen [including executing code on the stack]". Go slices exposing implementation details in a way that makes the language memory-unsafe is a completely different issue. If Go was sequentially consistent (so it had no "Memory Model" to violate), it would still not make the language memory-safe, because it would still write the array pointer and be pre-empted before writing the length.
> And that's not interesting
It matters because all programs have bugs (apparently), and so we'd like them to fail in a less harmful way than executing shellcode submitted by a client.
> it's easy to spot and fix data races, and Go provides plenty of tooling for that purpose.
Never used the data race detector but it probably can only identify low hanging fruit, and is not a substitute for the developer education problem.
Okay I think I see your confusion: You can actually avoid slices causing buffer overflows because the language requires you to have a happens-before relationship for all data shared across threads in the first place. That is, even if you share a boolean or across threads, you would be sure to establish a happens-before relationship if you are in the know. However, this does not rebuke my original argument, which assumes that most devs are not in the know. They do not know about slices being unsafe, nor do they know about happens-before. So they are not educated to prevent this mistake. Also, avoiding data races is hard.
They absolutely do.
https://software.intel.com/content/www/us/en/develop/blogs/b...
Violating the memory model gets you undefined behavior.
> However, this does not rebuke my original argument, which assumes that most devs are not in the know. They do not know about slices being unsafe, nor do they know about happens-before
I just don't agree. Go programmers know that nothing is safe for concurrent access unless explicitly noted otherwise. They don't have any confusion about slices requiring synchronization.
Concurrent programming isn't trivial but neither is it impossible. And data races are critical bugs that can be subtle, but are straightforward to identify, and straightforward to fix.
Dilettante programmers certainly do not know the following:
1. Go slices, strings, and interface values are unsafely non-atomic. It's documented on some obscure page (even the spec does not document it AFAIK, which is also broken).
2. What a data race is
Even if they know #1, they will still write code like: modifying a slice within a structure and setting a thread-shared pointer to point to that structure.
Again, most programmers are taught "things need locks, for reasons". At best, they will pointlessly lock things, then another programmer will come "debunk" them and remove the lock because "the thing being locked is atomic". Note how none of this involves any thought of the memory model. That's because they do not know it exists.
As for people who know #2, yes that is enough to avoid memory corruption without needing to know #1, however, they are not sufficiently informed how much data races matter (as executing shellcode is not an expected outcome of writes to your data being non observable).
It's a the same as the problem of knowing what data is used in what thread, which is hard and unsolvable by automation, so I doubt it's an easy problem.
I think I know what you mean, but that's a very dangerous way to word it when speaking in public. It would be more correct to say that "all you need to guarantee reads are protected by memory barriers is volatile."
The distinction matters because, to someone who doesn't already know all about volatile, the way you worded it might lead them to believe that `x++;` is an atomic statement if x is volatile, which is not true. That's a specific example of where things like atomic types are necessary.
(For the curious: https://www.baeldung.com/java-atomic-variables)
I think maybe what you're missing about what I'm saying is that I'm trying to mainly talk for the benefit of people who don't have a solid understanding of how to do safe and performant multithreading. Which is the vast majority of programmers. For that sort of audience, I tend to agree with dragontamer that "just use a mutex" is probably the safest advice to start out. Producing results faster doesn't count for much if you're producing wrong results faster.
In C++, you'd have to use OS-specific + compiler-specific routines like InterlockedIncrement64 to get guarantees about when or how it was safe to read/write variables.
Not anymore of course: C++11 provides us with atomic-load and atomic-store routines with the proper acquire / release barriers (and seq-cst default access very similar to Java's volatile semantics).
-----------
Anyway, put yourself into the mindset of a 2009-era C++ programmer for a sec. InterlockedIncrement works on Windows but not in Linux. You got atomics on GCC, but they don't work the same as Visual Studio atomics.
Answer: Mutex lock and mutex-unlock. And then condition variables for polling. Yeah, its slower than InterlockedIncrement / seq-cst atomic variables with proper memory barriers, but it works and is reasonably portable. (Well, CriticalSections on Windows because I never found a good pthreads library for Windows)
------
Its still relevant because you still see these thread issues come up in old C++ code.
Java has had volatile variables since the year 2000, I don't see how it's cheating that Java provided a standardized way of accessing a synchronized value before C and C++ did. Can you elaborate on your point that it's cheating?
In C and C++, for 10 years now, there is a standard library providing atomic data types and atomic instructions. Prior to the standardization one used platform specific atomic facilities. boost has provided cross-platform atomic operations that work on virtually every platform since 2002. Prior to 2002 there were no multicore x86 processors. There would have been mainframe computers that were multicore, is it your argument that code written for those mainframes are of relevant use today by fairly typical C and C++ developers?
At any rate, at no point did any of Java, C, or C++ require the use of a mutex in order to properly synchronize access to a "polling variable". Atomic operations were widely available to all three languages in various ways and would have been the preferred method.
This is definitionally correct (shrug)
> Dilettante programmers certainly do not know [that] Go slices, strings, and interface values are unsafely non-atomic.
Yes. They do. As soon as a Go programmer learns that there is such a thing as concurrency and "thread safety" they learn that nothing in Go is "thread safe" by default.
> Go is not memory-safe.
"Memory-safe" is not a precisely defined concept. Go is memory safe by some definitions, not by others.