A safe, non-owning C++ pointer class(techblog.rosemanlabs.com) |
A safe, non-owning C++ pointer class(techblog.rosemanlabs.com) |
Why intentionally design a worse alternative to std::weak_ptr which has been around since C++11??
One reason why this design can still be beneficial when using the standard std::shared_ptr in its implementation, is when you do not want to manage the pointee object by a std::shared_ptr (which is a requirement if you want to use std::weak_ptr). E.g., if you want to ensure that multiple objects of that type are laid out next to each other in memory, instead of scattered around the heap.
Another goal of the post is to show this idea, namely to use a shared_ptr<T*> (instead of shared_ptr<T>), which is kind of non-standard, but can be (as I hope I convinced you) sometimes useful.
Why would there be contention in a single threaded program?
Moving goalpost. But just to follow that thought: Decoupling alloc+init via e.g. placement-new to do this introduces a host of complications not considered in your solution.
If that layout _is_ a requirement, and you don't want a totally nonstandard foundation lib with nonstandard types promiscuously necessitating more nonstandard types, you want a std::vector+index handle.
There are alternative ways to utilize a machine with multiple cores, e.g. by running one thread per CPU core, and not sharing state between those threads; in each such thread you then have single-thread "semantics".
We were happy to use unique_ptr, however.
[0] https://chromium.googlesource.com/chromium/src/+/HEAD/base/m...
Doesn't get much glibber than that!
There is now atomic_shared_ptr which is thread safe.
In this particular case, it was subtly, Rust is preferred because it doesn't allow unsafe memory operations such as the one demonstrated. Really, all it demonstrates is that you can create really bad C++.
> Note that the control block used by std::weak_ptr and std::shared_ptr is thread-safe: different non-atomic std::weak_ptr objects can be accessed using mutable operations, such as operator= or reset, simultaneously by multiple threads, even when these instances are copies or otherwise share the same control block internally. The type T may be an incomplete type.
There’s no variant of shared_ptr / weak_ptr that is non atomic in the standard library AFAIK.
> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).
A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.
Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.
It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.
[0]: https://snf.github.io/2019/02/13/shared-ptr-optimization/
/* If this file is compiled with threads support, it must
#define __GTHREADS 1
to indicate that threads support is present. Also it has define
function
int __gthread_active_p ()
that returns 1 if thread system is active, 0 if not.
I think the mechanism eMSF was describing (and the mechanism in the blogpost I linked) corresponds to __gthread_active_p().I think the distinction between the two should be visible in some cases - for example, what happens for shared libraries that use std::shared_ptr and don't link libpthread, but are later used with a binary that does link libpthread?
Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?
Strange optimization IMHO so I wonder what was the motivation behind it. The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].
[0] https://codebrowser.dev/llvm/include/c++/11/bits/shared_ptr_...
[1] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....
[2] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....
[3] https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...
[4] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....
[5] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....
The line you linked is for some FreeBSD/Solaris versions which appear to have some quirks with the way pthreads functions are exposed in their libc. I think the "normal" implementation of __gthread_active_p is on line 248 [0], and that is a pretty straightforwards check against a weak symbol.
> Strange optimization IMHO so I wonder what was the motivation behind it.
I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.
> The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].
Not entirely sure what you're getting at here? The former is used for single-threaded programs so there's ostensibly no need for atomics, whereas the latter is used for non-single-threaded programs.
[0]: https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...
> I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.
Obviously yes. What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact. Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.
I mean, the blog post basically starts with an example where the performance impact is noticeable:
> I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one.
And:
> I just referenced pthread_create in the program and the reference count became atomic again.
> Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.
So in principle an insert-heavy workload for that data structure could see a noticeable performance impact.
> Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.
Not entirely sure I'd agree? My impression is that while uncontended atomics are not too expensive they aren't exactly free compared to the corresponding non-atomic instruction. For example, Agner Fog's instruction tables [0] states:
> Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.
And there's this blog post [1], which compares the performance of various concurrency mechanisms/implementations including uncontended atomics and "plain" code and shows that uncontended atomics are still slower than non-atomic operations (~3.5x if I'm reading the raw data table correctly).
So if the atomic instruction is in a hot loop then I think it's quite plausible that it'll be noticeable.
[0]: https://www.agner.org/optimize/instruction_tables.pdf
[1]: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...