Reference count, don't garbage collect(kevinlawler.com) |
Reference count, don't garbage collect(kevinlawler.com) |
With GC, you can do nothing at all. In a system with lots of garbage, you can do a GC by copying everything accessible from the GC root, then de-allocating all the garbage in a single free.
The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.
So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).
GC is way faster if there is little collection.
In memory or cache intensive applications, garbage collection as a whole can be significantly slower.
The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.
Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).
It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.
That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.
"Increments and decrements happen once and at a predictable time. The GC is running all the time and traversing the universe of GC objects. Probably with bad locality, polluting the cache, etc."
This is only the case with a mark-sweep collector, usually most of your allocations die young in the nursery. With reference counting you pay the counting cost for everything.
"In object-oriented languages, where you can literally have a pointer to something, you simply mark a reference as a weak reference if it might create a cycle."
As someone who has tried to identify memory leaks in production where someone has forgotten to "simply" mark a reference in some deep object graph as weak, this is naive.
"With an 8-byte counter you will never overflow. So...you know...just expand up to 8-bytes as needed? Usually you can get by with a few bits."
So now my "about as minimal as you can get short of nothing at all" check as an unpredictable branch in it?
"If you must overflow, e.g., you cannot afford an 8-byte counter and you need to overflow a 4-byte counter with billions of references, if you can copy it, you create a shallow copy."
I don't even know where to begin with this.
"If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?"
This probably has more to do with finalising resources and deterministic destruction than anything else.
--
Anyone who is interested in actually studying this area would probably find https://courses.cs.washington.edu/courses/cse590p/05au/p50-b... interesting. Also https://gchandbook.org/
Apple has a nice talk on ARC [1] but it got me thinking: if I have to think about reference counting this much I might just as well manage memory all by myself.
The true joy of Garbage Collection is that you can just create objects left and right and let the computer figure out when to clean them up. It's a much more natural way of doing things and lets computers do what they're best at: taking tedious tasks out of the hands of humans.
[1]: https://developer.apple.com/videos/play/wwdc2021/10216/
It might seem that it is simply about pushing your synchronizations problems onto the GC, but the synchronization issue that GC solves internally is different and usually more coarse-grained, so in the end you have significantly smaller synchronization overhead.
Btw, GC is also often RC in disguise. What I mean is that generational garbage collectors are basically a hybrid of tracing GC and RC. See https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... for the details.
Java's GC is concurrent and runs at safe points and stops the world so it avoids this problem.
Racket probably has a state-of-the-art garbage collector. (I don't actually know, but that's where I would start looking.) Clojure obviously has the same garbage collector as any other JVM language.
In one extreme you can build Racket using the Senora GC that is conservative and not moving, that is used only for bootstraping.
On the other extreme, both of the normal versions of Racket have custom moving incremental GC. The docs with some high level explanations are in https://docs.racket-lang.org/reference/garbagecollection.htm...
The implementation details of the main "CS" version are in https://github.com/racket/racket/blob/master/racket/src/Chez... It's a replacement of the default GC of Chez Scheme that has better support for some features that are used in Racket, but I never have looked so deeply in the details.
Yikes
GC is only required if you as a programmer (or programming language) do not provide sufficient information to the compiler or runtime to understand the object graph.
You can find various algorithms in journals or whatnot written with the assumption that there's GC. Algorithms designed with this assumption may not have clear ownership for objects, and those objects my have cyclic references.
It's easy to say, "objects should have clear ownership relationships" but that kind of maxim, like most maxims, doesn't really survive if you try to apply it 100% of the time. Ownership is a tool that is very often useful for managing object lifetimes--it's not always the tool that you want.
Or do you want to manually assign memory addresses to your objects?
If you wanted performance these days, you wouldn't want to go for that architecture. It's a historically accident that they can't really free themselves from because of backwards compatibility.
Apparently it actually led to memory usage improvements in industrial projects like Redis:
The rant at the end can be boiled down to "I use confirmation bias [1] to make my engineering decisions". The OP has already decided that "GC" is slow, so I'm sure every time a runtime with it misbehaves it's "Well, that darn GC, I knew it was bad!" and every time RC misbehaves it's likely "Oh, well you should have nulled out your link here to break the cycle dummy!"
I really don't like such absolutist thinking in software dev. All of software dev is about making tradeoffs. RC and GC aren't superior or inferior to each other, they are just different and either (or both) could be valid depending on the circumstance.
Yes, this is a good point. It makes overly general claims.
E.g. a GC proponent could claim "well, tracing collectors do no work for dead objects, so they have no overhead!" Which is a good point, but not the whole story. Tracing collectors may need to repeatedly traverse live objects. Sure. But then generational collectors only traverse modified live objects that point to new objects. True. And concurrent collectors can trace using spare CPU resources, incremental collectors can break marking work up into small pauses, on and on. There are zillions of engineering tradeoffs and the GC Handbook covers most of them really well.
But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index.
If you use a language with GC, you're probably just contributing to global warming.
(Basically, your lifetimes have to be the same as your scopes, which are in a simple tree structure only.)
I don't think python ever did pure mark-and-sweep ( cpython, at least, I'm sure jython and other alternate implementations have ).
My understanding was that they did pure reference counting, and kludged on a sweep GC to do cycle breaking eventually, as manually breaking cycles in early versions of python was a pain point. A quick lookup seems to indicate python1 was pure reference counting, and they added the cycle breaking when they released python2.
Yes, if you forget to mark a ref-counted pointer as weak, then you may get a memory leak. Memory leaks can be disastrous, but they can be benign, and they're always better than use-after-free.
In C++ it is easy to create a raw pointer (with & or *). It's unsafe. Soon enough, you have some lambda inside another function, but the lambda is executed after the enclosing function returns, and you've captured a variable with &. Oops. You thought that the lambda would get executed during the enclosing function's execution, but you misread the API you were using.
IMO the big productivity gain is being able to write code where I don't have to think too hard about whether the code is memory-safe. Modern C++ code makes this easier, but languages with ARC (like Objective-C or Swift) make this even easier. Code is mostly safe by default, and you can visually inspect code to look for unsafe behavior, more easily than you can with C++.
There are also hybrid ref-counted + tracing GC options. CPython uses this approach.
For example they make certain kinds of caches easier.
Apart from performance (latency vs throughput) considerations, the only difference between RC and GC for your algorithms and datastructure design is whether you allow circles in your datastructures.
Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.
Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...
One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.
However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.
Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.
One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.
Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.
And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.
GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.
RC determines an object has become garbage when its reference count goes to zero, and then collects it.
That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.
That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)
It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).
RC with borrow checking can avoid a lot of refcount increments.
Tracking GC typically needs write barriers, so it’s not free either.
I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.
> RC with borrow checking can avoid a lot of refcount increments.
That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)
Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.
Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.
Typical GC passes consist of little else but pointer chasing.
But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.
Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.
Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.
If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.
Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.
This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.
For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.
RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.
There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.
Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.
Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.
One of the problems with this is that reference counting touches all the memory right before you're done with it.
> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.
This is not a typical solution.
Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.
Garbage collection does not need to be used to collect all objects.
> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.
When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."
Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".
All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.
> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.
But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.
This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.
"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.
But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?
You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.
(Deterministic deallocation also means having to recursively free unreachable objects. That's often described as an arbitrary "pause" behavior in RC systems, but it's actually inherent in the requirement for deterministic behavior. If you don't care about determinism for some class of objects, you can amortize that pause by sending them to a separate cleanup thread.)
My own read on this is that it blurs the line with deferred collection/counting, because you could either use it to complement deferral making it cheaper, or avoid deferral because you're getting enough of the benefits of deferral by proving objects dead instead of discovering that they are dead.
The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.
This lead to one of the more entertaining C# memory leak stories, where Princeton's entry to the DARPA Grand Challenge ended up failing because every frame they detected obstacles, created a class for each, and subscribed each obstacle object to an event. They missed that the event subscription was keeping those objects alive, and every piece of tumbleweed in the desert helped leak memory until the car just stopped! https://www.codeproject.com/Articles/21253/If-Only-We-d-Used...
The codebase I work with has had many pathological crashes due to this behavior.
So basically in C# when you use += to subscribe to events, in a big system where lifetimes of objects are independent of each other, you're back to a C/C++ mindset where you should check you have a -= call for the subscribed object when the subscribing object is about to run out of scope. Else you get random crashes, when you get events delivered to an object that should have been dead.
This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.
There's WeakEventManager [0] but that's available only in "classic" dotnet framework (and in "new" dotnet but only if you're targeting Windows) since it lives in the WPF namespace. It can be used outside of it, but you still take a dependency on System.Windows.
There are some other bespoke solutions too.
There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.
[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...
If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.
Which pauses you are meaning?
Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).
Nim switched from GC to RC and it even increased performance.
Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...
When I can easily replace the deallocator (thus excluding most non-RC production GCs), I can (re)write the code to avoid GC pauses (e.g. by amortizing deallocation, destructors, etc. over several frames - perhaps in a way that returning ownership of some type and its allocations to the type's originating thread, and thus reducing contention while I'm at it!) I have done this a few times. By "coincidence", garbage generation storms causing noticable delays are suprisingly uncommon IME.
As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.
It's also been my experience that tracing or profiling any 'ole bog standard refcounted system to find performance problems is way more easy and straightforward than dealing with the utter vulgarity of deferred, ambiguously scheduled, likely on a different thread, frequently opaque garbage collection - as found in non-refcounted garbage collection systems.
So, at best, you're technically correct here - which, to be fair, is the best kind of correct. But in practice, I think it's no coincidence that refcounting systems tend to automatically and implicitly amortize their costs and avoid GC storms in just about every workload I've ever touched, and at bare minimum, reference counting avoids GC pauses... in the code I've written... by allowing me easier opportunities to fix them when they do occur. Indirectly causal rather than directly causal.
> if you read a heap-object out of a data structure, you have to increment the object's reference count.
This isn't universal. Merely accessing and dereferencing a shared_ptr in C++ won't touch the refcount, for example - you need to copy the shared_ptr to cause that. Rust's Arc/Rc need to be clone()d to touch the refcount, and the borrow checker reduces much of the need to do such a thing defensively, "in case the heap object is removed out from under me".
Of course, it can be a problem if you bake refcounting directly into language semantics and constantly churn refcounts for basic stack variables while failing to optimize said churn away. There's a reason why many GCed languages don't use reference counting to optimize the common "no cycles" case, after all - often, someone tried it out as an obvious and low hanging "optimization", and found it was a pessimization that made overall performance worse!
And even without being baked into the language, there are of course niches where heavy manipulation of long-term storage of references will be a thing, or cases where the garbage collected version can become lock-free in a context where such things actually matter - so I'll 100% agree with you on this:
> There are no hard lines; this is about performance tradeoffs, and always will be.
It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.
It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.
Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.
The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.
Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.
RAII gets you a lot of the way there.
So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.
> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?
Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)
Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.
Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.
Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:
Copying Collector 8.9
Reference Counting Collector 21.9
Cycle-collecting Reference Counting Collector 28.7
Mark & Sweep Collector 10.1
Mark & Sweep (separate mark bits) Collector 9.6
Optimized Copying Collector 9.0
I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.
There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.
Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.
Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.
[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html
It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.
RC is a performance compromise.
JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.
Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.
Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.
One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!
But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.
With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.
Moving is indeed a big pain, and I'm about to back out of it for Oil :-/
----
edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:
https://news.ycombinator.com/item?id=32283641
I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.
I've never heard of a reference counting implementation that can handle memory compaction.
Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.
When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.
It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.
But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh
They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.
Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.
You can see this way on
http://sblom.github.io/openj-core/iojNoun.htm
ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.
When I worked at Facebook, which is structurally and politically incapable of building high-quality client software, I was on a small team of people tasked with making heroic technical fixes to keep the iOS app running despite literally hundreds of engineers working on the same binary incentivized to dump shoddy code to launch their product features that nobody would use as fast as possible (did you know that at one point you could order food through the Facebook app, and that a whole two digit number of people per day used this feature? Etc.)
Objective-C has ARC (automated reference counting) — every pointer is a refcounted strong reference by default unless special annotations are used. What makes it worse is that large, deep hierarchies are common, making reference cycles leaking huge amounts of memory easy to create.
For example, the view controller for a large and complicated page (referencing decoded bitmap images and other large objects) is the root of a large tree of sub-objects, some of whom want to keep a reference to the root. Now imagine the user navigates away and the reference to the view controller goes away, but nothing in the tree is deallocated due to the backlink — congratulations, you just leaked 10 MB of RAM!
It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.
To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.
It was totally egregious undefined behavior, one thread tracing the heap with no synchronization with respect to the application threads that were mutating it, but the segfaults this UB caused were so much rarer than the crashes due to OOMs that it prevented that we decided to continue running it.
With allocating as dead, you're basically turning it into a tracing collector for the young generation.
It can be quite "fast."
Very cool stuff!
If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.
I'm not saying that GC is always the best choice, but this article gets the most important argument wrong:
> 1. Updating reference counts is quite expensive. > > No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all.
Yes, it is. Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".
Additionally, every modern GC does generational collection, so for the vast majority of objects, the GC literally does "nothing at all". No matter how little work it does, a RC solution has to do O(garbage) work, while a copying GC can do O(not garbage) work.
Now, that's not to say that GC is automatically better. There are trade-offs here. It depends on the workload, the amount of garbage being created, and the ratio of read to write operations.
The article says:
> I've already stated I'm not going to do benchmarks. I am aware of two orgs who've already run extensive and far-reaching experiments on this: Apple, for use in their mobile phones, and the Python project.
I can counterpoint that anecdata: Google extensively uses Java in high-performance systems, and invented a new GC-only language (Go) as a replacement for (their uses of) Python.
The right answer is to do benchmarks. Or even better yet, don't worry about this and just write your code! Outside of a vanishingly small number of specialized use cases, by the time GC vs RC becomes relevant in any meaningful way to your performance, you've already succeeded, and now you're dealing with scaling effects.
That's not true. Go was invented with the intention of replacing C++ at Google. That didn't really work out, and in practice Go became more of a replacement of Python for some applications at Google.
Also there are some indications that Go didn't gain traction necessarily on the merits of the language itself, but more on the starpower of its authors within Google.
(I mostly agree with the rest of what you wrote.)
Atomic reference count may trigger cache flush in other CPUs/stall waiting for them to do that, so it's not so minimal indeed.
Well, ok, let's go whole hog, we're collecting garbage again, and it sucks, we get all these baby objects, let's try and optimize the GC: we can keep, I dunno, a count of references to new objects, do some allocation sinking to see if we can avoid making them, put the babies in an orphanage, hey look, RC is GC, QED.
If I needed hard-realtime, I would avoid allocation entirely.
This blog post is an answer to: "Tell me you haven't learned about cache coherence without telling me you haven't learned about cache coherence."
[citation needed]
You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.
My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.
https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html
Way better than RC.
And.. often GC will be able to use area allocators, before falling back to "proper" GC allocation. Which will be a lot faster than ref counting everything.
And atomics can get very slow, I've had atmics show up regularly in the profiler.
For my project, the combination that works great so far: unbox all types, use area allocators if the compiler can guarantee the value doesn't escape, use GC for data that changes often and ref counting for data that hardly ever changes. (luckily cycles are not possible)
I have an example from early in my career where I accidentally created a memory leak in Python from a cyclic reference between a closure and a function argument
https://stackoverflow.com/questions/54726363/python-not-dele...
> 1. Updating reference counts is quite expensive.
> No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all. The comparison is the ongoing garbage collection routine, which is going to be more expensive and occur unpredictably.
First off, updating the reference count invalidates the entire cache line in which the reference count lives. For naive reference counting (which I'm assuming the author is talking about since they give no indication they're familiar with anything else), this generally means invalidating the object's cache line (and with an atomic RMW, to boot, meaning you need a bus lock or LL/SC on most systems). So right away, you have created a potentially significant cacheline contention problem between readers in multiple threads, even though you didn't intend to actually write anything. RC Immix, for example, tries to mitigate this in many creative ways, including deferring the actual reference count updates and falling back to tracing for reclamation when the count gets too high (to avoid using too many bits in the header or creating too many spurious updates).
Secondly, you know what's cheaper than an atomic increment or decrement? Not doing anything at all. The vast majority of garbage in most production tracing garbage collectors (which are, with the exception of Go's, almost exclusively generational) dies young, and never needs to be updated, copied, or deallocated (so no calling destructors and no walking a tree of children, which usually involves slow pointer chasing). Even where the object itself doesn't die young, any temporary references to the object between collections don't have to do any work at all compared to just copying a raw pointer, C style. This and bump allocation (which the author also does not engage with) are the two biggest performance wins that tracing garbage collectors typically have over reference counting ones, and solutions like RC Immix must implement similar mechanisms to even become competitive. You don't even need to go into stuff like the potential benefits of compaction, or a reduction in garbage managing code on the hot path (which are more dubious and harder to show) to understand why tracing has some formidable theoretical advantages over reference counting!
But what about in practice? Surely, the overhead of having to periodically run the tracing GC negates all these benefits? Well, bluntly--no, not even close. At least, not unless you care only about GC latency to the exclusion of everything else, or are using something fancier (like deferred RC). You can't reason backwards from "Rust and C++ are generally faster than languages with tracing GCs on optimized workloads" to conclude that reference counting is better than tracing GC--Rust and C++ both go out of their way to avoid using reference counting at all wherever possible.
None of this is secret information, incidentally. It is very easy to find. The fact that the author is apparently so incurious that they never once bothered to find out why academics talk about tracing GC's performance being superior--and the fact that it was so dismissive about it!--makes me pretty doubtful that people are going to find useful insights in the rest of the article, either.
When the counter is 1, you can do anything with the object without affecting any other references.
Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.
Pretty weird argument for one of the slowest languages out there ...
Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.
It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.
Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this. But isn't that an argument for reference counting?
Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...
It is less boring than building up the skills to fix the plane in mid-flight.
In Python reference counting precedes tracing garbage collection.
So they didn't 'go through the hassle of implementing reference counting' after they already had tracing garbage collection. Instead they went through the hassle of implementing tracing garbage collection after they already had reference counting.
(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)
https://docs.python.org/3/reference/datamodel.html#objects-v...
So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.
But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.
Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".
My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.
The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.
The implementation was never stable enough beyond toy examples.
Naturally automating the Cocoa release/retain calls made more sense, given the constraints.
In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.
When Swift came around, having the need to easily interop with the Objective-C ecosystem naturally meant to keep the same approach, otherwise they would need the same machinery that .NET uses (RCW/CCW) to interop with COM AddRef/Release.
What Apple has is excellent marketing.
Any true GC strategy (== one that collects cycles) will fundamentally touch and allocate more memory than malloc/free, where reference counting is pretty close to malloc/free performance; it doesn’t need to touch any memory not involved. At OS scale, that’s a huge performance advantage. You can scope down the memory involved in GC using advanced, modern GC techniques, but you’re still going to be behind malloc/free in overall efficiency — cache efficiency, memory maintenance overhead, and additional memory required for bookkeeping. And reference counting will be pretty darn close to malloc/free.
[0] https://stackoverflow.com/questions/32262172/how-can-identif...
Reference counting and Garbage Collection have very clear difference: when the referenced objects are destroyed (not deallocated). In RC it happens when the count reaches zero. In GC it happens some time later. That difference is crucial for having or not having deterministic performance in your program.
Things are rarely as clear cut as we would want to believe.
Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.
Go avoids it by having scope-based "defer" rather than destructors.
They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.
Also, I fail to see the advantage of RC in case of a presumably mostly immutable language - a tracing GC is even faster there due to no changes to the object graph after allocation, making a generational approach scale very well and be almost completely done in parallel.
I see great reasons for both systems being useful, but both systems also bring their own warts.
Yes, ref counting affects cache and branch prediction, but gc is a whole complete subsystem running in parallel with your main code, constantly cleaning up after you. It will always depend upon the application which will determine what's best for that application.
Some languages lean heavily one way than the other too. Scripting with ref counting would be a nightmare, as would running a garbage collector on an 8bit micro. Since the article's talking C & C++, then of course a pro ref counting stance makes sense.
Not sure if it's entirely deterministic. A variable going out of a scope can trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen (especially if objects have destructors with side effects, your object graph is highly mutable, and your code is on a hot path). A common trick is to delay deallocation to a later time, but then again you can't be sure when your destructors will be run. Another issue is cycles, if your RC system has cycle detection, your program will behave differently depending on whether a cycle formed at runtime or not.
Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).
I would not be surprised to find that even a naive mark and sweep collector is faster than naive refcounting on some workloads. One obvious thing to consider is that the work is delayed, you can perform the sweeping 'as needed'. Even the marking doesn't have to run on any deterministic schedule.
The thing is that, from my naive perspective, run of the mill tracing collector algorithms are just way more advanced than your typical refcount. Most refcounting is just that - either an integer, atomic integer, or both, that gets incremented and decremented based on a number of operations applied to the underlying type. The naive approach has no delays.
Tracing GCs on the other hand, although perhaps not naive ones (could you link me info on the quickfit algorithm? I can not find anything online), might contain epochs that bump allocate in the majority of cases. That'll be particularly nice for benchmarks where allocations are likely very short lived and may actually never need to get to the mark/sweep phase. Your algorithm isn't really documented and I just really don't feel like looking at C right now.
Although naive refcounting is very common it's not the only game in town. Depending on the language you can group refcounts together - for example, imagine you have:
(assuming all fields are automatically refcounted) struct Foo { bar: Bar, baz: Baz, }
In theory, a "copy" of this type would involve 3 increments, possibly atomic increments. Each increment would also require a heap pointer dereference, and there would be no locality of those integers behind the pointers. That would be the trivial implementation.
But depending on the language you could actually flatten all of those down to 1 RC. This is language dependent, and it requires understanding how these values can be moved, referenced, etc, at compile time. You could also store all reference counts in tables associated with structures, such that you have locality when you want to read/write to multiple counters. The pointer dereference is going to be brutal so having locality there will be a nice win. I'd be curious to run your benchmarks through valgrind to see how much the refcount is just spending time on memory fetches that get invalidated in the cache immediately.
Anyway, an example of a pretty slick refcounting GC is what Pony built: https://tutorial.ponylang.io/appendices/garbage-collection.h... https://www.ponylang.io/media/papers/OGC.pdf
Pony has different types for: 1. Local, Immutable 2. Local, Mutable 3. Shared, Immutable 4. Shared, Mutable
You can read the paper where they discuss how they track local variables vs shared variables, the implementation of counter tables, etc.
So I guess to summarize:
1. The results make sense, or as much sense as anything. I'd be interested in more details on the algorithms involved and your benchmark methodology.
2. "Naive" tracing GCs are actually pretty advanced, and advanced refcount implementations are pretty scarce.
Throughput-wise, it's hard to beat naive tracing gc. The algorithms are just too simple and they don't "interfere" with "normal operations" like ref counting does. Assuming the same allocation pattern (i.e no cheating by stack allocating objects), a tracing gc would likely (again, throughput-wise) beat manual memory management too. The additional benefit tracing gives you is easy heap compaction. Thus future pointer-chasing and memory allocations will be more efficient. With ref counting, compaction is harder.
True, you could delay sweeping, but ime, marking time dominates so you don't gain much. Even with a huge heap of several gigabytes, sweeping is just a linear scan from lowest to highest address.
Quick fit is a memory allocator, see: http://www.flounder.com/memory_allocation.htm Most gcs do not keep the heap contiguous so you need it in a layer below the gc. Quick fit is the algorithm almost everyone uses and it is very good for allocating many small objects of fixed sizes (8, 16, 32, etc.). It could be swapped out with malloc/free pairs instead, at the price of some performance.
I have to disgree with naive tracing being advanced. My mark & sweep implementation is only about 50 lines and that includes comments: https://github.com/bjourne/c-examples/blob/master/libraries/... A copying collector isn't much more complicated. Neither is beyond the reach of most comp sci students. Yes, optimized multi-generational tracing collectors supporting concurrent and resumable tracing makes them very complicated. But the same is true of optimized ref counting schemes. :)
Pony looks very interesting. It looks like it is supposed to have less object churn than very dynamic languages like JavaScript which probably makes ref counting very suitable for it.
As a language, Java's not too bad. It's a bit wordy, in bad need of some syntax sugar, but it's designed to be fairly straightforward and for the most part it does its job well. I don't need a degree in language theory to get started writing it.
Java the programming style, particularly enterprise, is a horrendous over-engineered mess that schools jam down the throats of students who don't know any better. It's designed to (and fails to) enforce a common style that can be written by armies of mediocre developers plodding along inside giant enterprise codebases, so that no matter who wrote the code, some other developer in another department can figure out how to call it.
Java the JVM is a pretty nifty beast. It's made tradeoffs that means it's not always suited for every use case, but put in its element it really shines. The modern GC algos give developers options based on the program's needs. It's currently struggling to overcome some historical decisions that while good back in the old days are now holding it back.
Personally I'm very biased towards Kotlin, which gives me the benefit of the JVM without the barf that is Java. It's not the fastest-executing language out there, but for me it's a perfect balance between development speed, ecosystem of battle-tested libraries, and competitive execution speed.
Anyone who has ever shipped a C# Unity game know the pain that is the garbage collector. It’s effectively impossible to avoid frame hitches with the GC.
I’ve spent a LOT of time going way out of my way to avoid any and all garbage collects. Which somewhat defeats the purpose of using a GC-based language.
I definitely would say “GC used to be bad but now it’s good”. That tale has been spun for 30+ years at this point!
That said, it could also be a function of the same "problem" Java has in its design - Java by default boxes everything and so every memory allocation increases garbage collection pressure. Go, by using escape analysis and favoring stack allocations, doesn't have this problem and has had sub-millisecond pauses for years now.
You rarely hear people complain about Go's GC despite it being much less mature than the JVMs. But due to actual language design, Go's GC does much less work. I wouldn't say “GC used to be bad but now it’s good”, but that "the design of languages like C# and Java were too dependent on using the GC for memory allocations, and there are other GC languaged out there that utilize the GC less which can lead to greater performance"
I believe it's actually the opposite - Java has pretty simple, compact and well defined semantics. Too simple and compact for confort - a little syntatic sugar would have made the language a lot less verbose.
What’s P&L?
> Java's verbose semantics
Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
I would say that isn't that Java's semantics are that verbose, it's that the way Java is traditionally written, with every line actually 3 lines on your screen of
public function makeItalicTextBox(String actualTextIWantToBeItalic) { ItalicTextBox itb = italicTextBoxFactoryGenerator.GenerateFactory().buildItalicTextBox(actualTextIWantToBeItalic); return itb; }
I think this is actually the pernicious work of Martin's _Clean Code_ which trained a whole generation of Java coders to write nonsense one line functions with a million interior function calls like this, not anything forced by the Java language itself, but it makes for really ugly code, in my exceptionally humble but undoubtedly correct opinion.
I meant PL research as in programming language research. Not sure why my brain decided to put an & there.
>Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
`FizzBuzzEnterpriseEdition` and is a meme for a reason. That said I'm sure Java idioms written today is a lot more sane than they were around the time everyone decided to rewrite everything in Ruby. When I say Java here, I'm talking about an era of Java that probably no longer exists (the JVM also doesn't have 10 second pauses today either and is widely considered the best garbage collector ever built).
However we still don’t have something like auto/let from cpp/rust.
This pretty much nails down what I imagine is the main difference between GC and ARC: with the former you sacrifice performance for ease of use, and with the latter you improve performance by placing some additional work on the programmers.
> Secondly, you know what's cheaper... Not doing anything at all.
These techniques are on the level of resetting a stack pointer or calling `sbrk()`. Incorporating them doesn't produce more-advanced GC schemes, it just means you neglected to consider similar allowances for RC.
The line of contention is at traversing the object graph and pausing threads.
Very true. No matter how one looks at it, GC imposes a non-trivial overhead. If I remember correctly, wasn’t there a study that demonstrated an average Swift application spending 40% of its time on RC?
But at the same time hardware can be improved. Apple CPUs are an obvious example. Uncontended atomics are extremely cheap and the memory prefetcher can detect and prefetch pointer chases.
That was a new one for me. I sort of understood the reference but the Jeff Atwood post [0] where it's introduced is as relevant today as it was when this was published in 2008.
It is. It might run when a behaviour finishes running, according to their docs.
The biggest problem is that reference counting is baked into how CPython's C-modules work.
With a bit of generics and lambdas, they can be made to even be more defer like.
Where GCed languages pulled ahead for a while, particularly Java, was that they had mature concurrent memory allocation before most OSes did, so many people found on early two and four core hardware that malloc started to show up as a bottleneck.
Not sure if it's still relevant, though.
One popular physics library years ago went as far as instrumenting compiled bytecode to turn all vector/matrix allocations to fetching preallocated objects from a pool, because a simple math operation could allocate tens/hundreds of vector/matrix objects and GC was slow, but then in newer versions they removed it because Java's GC became fast enough.
When it finally gets destroyed, its Destructor method would be called. At which point the thing at the other end of the network is told that it was talking to a zombie.
Note that network connections can be expensive for the other end, so this is a horrible design. We put a lot of work in to reliably get rid of connections when not needed. But you still need the fallback of being able to correctly handle programmer oversights.
When our observer is also dead (so the pair is out of scope) it will be a dead object receiving events.
Gave me a chuckle.
That's the equivalent of an escape analysis I assume, which optimization exists for tracing GCs as well.
Also, there are hard-real time JVMs -- but hard-realtime really doesn't mean what people often believe. These systems are usually much slower - that's the price they pay for having a fix higher bound on some latency requirement.
At least, that's how most people I knew back in 2006ish were using it.
What’s common though is that hard drives were not getting faster, but network hardware was hitting its stride. Full bandwidth (port speed X port count) routers and multi NIC were recently ubiquitous.
I had just come off a carrier grade software project when memcached finally hit my radar, and we had only spec’ed 3-5 servers for the web tier. That was still enough local cache hit rate to keep us running relatively smoothly. Or at least once we got done being honorary QA members for F5. We had lots of problems on that project with data modeling and so we actually were using too much caching vs precalculation but that’s a different story.
Where is the reference count going to 0 here? Presumably the object you dragged from one parent to another parent just swapped the reference and no ref count ever went to 0 which would have triggered a free.
I'm guessing OP meant deterministic as in, "oh when I profile my code and it hits that big lag spike, this occurs in my particle system. Let's swap that to a fixed size static memory allocation instead of RC". Whereas with GC you say, "oh my program has a big lag spike. I wish I could tell why that is, but the code executing at that time may or may not have been the reason GC was collecting then". The only experience I've had trying to get accurate profiling data on memory usage in Java/C# was aggravating to say the least.
Also, most games these days (and probably since the beginning of gaming) don't allocate and free stuff randomly. They're most likely using RC on big blocks of memory that are shared among several other game objects. Or RC on several fixed size pools that get used for different components like in an ECS. Or they use RC for things like long living assets that have difficult lifetime management.
Honestly, RC and garbage collection are unnecessary in most instances. Most of the time all you need is a unique pointer because you know exactly where the ownership of the object lies. For the edge cases, RC works fine.
I guess my biggest complaint is, why rely on a magical algorithm that you have no control over? I've had so many "fun" times trying to force GC to collect at specific points. It's really nice when you can say, "at this point in the program, this ref count will go to 0 and it will free and I don't have to worry about it lagging anything". Rather than "please GC.collect(), actually do something when I call you right here and don't wait another 5 minutes and ignore my pleading".
Because that's not the type of application for which a non-deterministic GC is suited. The vast majority of applications don't need that determinism. Right tool for the job.
Escape analysis is hard, and GC is typically used to free programmers from assisting it with extra syntax or unsafety. This is why GC languages tend to use generational GC instead of having precise analysis.
Whoa! Marketing? Interop with C is a MASSIVE use-case. And iOS is A LOT faster and more responsive than android. That is, right there, total proof that can't denied.
I do both, and the speed of apple way and the simplicity of bridge the C-abi is not a joke.
My main point actually is that Apple don't pick this route for the fun of it. It HAS a need to be performant on mobiles devices (that have less luxury to get a GC that eat Ram) and that need represent A LOT OF MONEY. You can say any step about this, including making processors, are directly or indirectly related to the need.
P.D: I don't see Rc VS Gc as enemies, but as faces of the same coin. ARC is just a way to apply some Gc lessons to naive Rc. I think naive Rc is truly worse than Gc, but smart Rc is total fine and consider it the best of both in the general case...
It still take time linear in length, but dead nodes are by definition stable so it does not really matter when you free them.
fun replace_head_with_1 (_::xs) = 1::xs;
val a = [2, 3, 4];
val b = replace_head_with_1 a;
Now a and b share the tail.And a linked list is only a simple demonstration of a chain of pointers. It occurs spontaneously outside of containers, when you just write e.g. classes which have objects of other classes as their fields. In Haskell that would be records, or just an algebraic data type. Or just closures.
Yeah, I think it's an inelegant, brute-force solution to a language problem - and that we continue to throw good money after bad improving it.
We should be investing in removing the need for GC through smarter compilers and through languages that allow us to better express our intent - and our object graph. Instead, we continue to improve on a giant linked-list traversal through all of live memory to try and infer the object graph at runtime. Without sufficient information to do it efficiently. The fundamental issue is that we don't have languages that express our goals in a way that allows memory management to be elided efficiently.
That doesn't mean I have some particular affinity for reference counting, though. It has its own issues as you rightly point out. I prefer it for its determinism, nothing more.
Having studied GC, implemented GC, and used it extensively (either as a dev or someone in operations) I'd say that there's just a lot of people out there who don't understand it. That's why people come to the wrong conclusion that it's somehow "inelegant" or "brute-force", when it is definitely neither of those.
There's also a lot of people who formed their opinions on GC from, say, what the JVM was like in the 1990s.
And there's a lot of people who only have a vague idea of how GC works in theory, but no knowledge of deeper theory and no practical knowledge of how real-world GC algorithms work.
I agree. To better understand garbage collection, nothing better than implementation. This article is such a joy to read:
https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...
There is nothing to understand or not understand there, it's hard data.
I really don't think it is a language problem, for several algorithms that one may want to write, GC is a necessity as object lifetime is not decidable at compile time. And it's not just some special parallel algorithm, I believe most business problems fall into this category - that's why we have basically "hacks" like RC, shared ptr, etc in low level languages to let us fake a partial GC (not collecting cycles).
On the plus side of GC, it has been one of the biggest productivity boosters the field has seen since the first non-assembly PL.
The paper mention elsewhere in the comments, https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... "A Unifying Theory of Garbage Collection" put this in context.
For example, a generational garbage collection essentially employs a hybrid approach between tracing and reference counting.
Similarly, depending on your system's trade-offs, reference counting might be better. Eg tracing garbage collection works really great for main memory, but for filesystems we typically use reference counting to decide whether to retire an inode. (The paper explains this in more detail.)
There is nuance here. They claimed that their project is faster than a specific hand optimized project. Not faster than a theoretical peak performance c++ program.
I've run into similar situations where python reimplementation s are faster than java because python is easier to change and fixup algorithms. And %timeit in the ipython shell is way easier than the black magic involved in profiling and benchmarking java.
You also have people on rust subreddit or discourse asking for optimization help when their rust is not as fast as a Go example they wrote. Often you get buffered IO going and it's on par. But to melt faces like ripgrep and friends you often need to drop pretences and work on Vec<u8>s.
Unfair criciticm... first, Java has had a REPL for several years and you can time stuff like in Python as easily... second, profiling tools in Java are some of the best available, and are not blackmagic... quite simple to use, just attach them to the running process and hit "profile".
With that said: yes, I've also seen Java programs that run faster than the Rust or C counterpart. I suspect what OP saw falls into this category: a rare example that you take as a rule (maybe I misinterpreted the claim, I admit, but it does sound OP meant his Haskell program and, I assume, others which you write in the same style, cannot be beaten by the equivalent C++).
You're right that the java tooling is powerful. But getting numbers out of the JVM is not the black magic. The dark arts are setting up an experiment, managing the JIT warm-up, and interpreting the results. In my experience it's just hell trying to turn those numbers into a convincing argument that we have confidence in our performance. Concrete example: convincing openjdk11 to use AES instructions Vs conscrypt using AES out of the box with no warm-up...
The whole song and dance means that we just don't take it as a priority because it's a tiring sink of effort. On the other hand %timeit is so easy and remains consistent that you can use it with unit tests and offer algorithm fixes in PRs.
shared_ptr and unique_ptr have pretty significant overhead and are common practice, even for optimized codebases, so I wouldn't say it's impossible at all.
shared_ptr is expensive and easy to build leaks with, so a lot of code bases avoid it where possible. Though it's only expensive when you copy it, moving it is ~free.
The problem with that in the case of C++ is that you likely only want to leak things used in the end from the main thread, but not "recursively" - the distinction is hard to do.
Furthermore it also allocates like the pest: busy Haskell programs regularly allocate on the order of 1GB/sec. However, this works out fine because the majority of those allocations are short-lived, hence become dead quickly, hence a GC that is designed to only touch live data (like Haskell's GC!) will handle that well.
Interesting; with purity and eager evaluation, and presumably no mutually recursive bindings either, you should indeed get absence of cycles. I do wonder: does absence of mutually recursive bindings come back to bite you at some point? Or does the need for that just not arise in your domain?
Is there stuff that you feel is awkward to express in your language that would be fine in Haskell proper?
(Exception is shared pages, not used for heap memory often.)
https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=86...
From what I've been told, C++ games do another thing and just use arena allocators, which make allocations cheap (just an integer addition, provided you know a reasonable upper bound on the size of an arena, and if you don't, you may reserve a large piece of virtual memory, and commit it later in reasonably small, but also not too large chunks) and free-s even cheaper (either reset the integer, so that the arena can be reused, or a single syscall to unmap the whole arena in one go). That's a different strategy than making a lot of little allocs and frees, and then trying to minimize them by reusing objects, which is also quite hairy.
Which is a very good reason to develop an optimized GC algorithm, the domain experts can crank out code without having to optimize every single memory (de)allocation which sounds like a waste of their time.
It’s funny, people don’t usually doubt that a modern compiler can do a better optimization job than an expert but add a memory management algorithm and that’s a bridge too far.
However, a GC is a lot slower than manual memory management, which contrasts with the fact that most compiler activities are actually pretty low in overhead (now - it didn't used to be this way). Really, the only cost overhead left is the abstraction mismatch, and that is not too bad, when you compare to how bad humans are at writing assembly.
That said, this case looks like one where the C++ experts spent very little time optimizing (mostly writing business logic), and probably made a very poor choice of tools.
There is a triangle of GC performance; througput, latency (i.e. pause length), and memory overhead. Manual memory management will often be slower (in the throughput sense) than a throughput-tuned GC because:
1. Manual memory management typically precludes moving live data
2. Manual memory management often frees data as soon as it is dead
GC will often have faster allocations than manual memory management because #1 makes it possible to just use a pointer-increment for allocation. GC will often have faster freeing of data because of #2; in particular using a nursery with Cheney's algorithm makes it O(1) to free an arbitrary amount of data.
Where a throughput optimized GC falls down is in that any code that allocates may have an unpredictable amount of delay.
Also note that for video games, both typical GC and malloc/free are often too slow for per-frame data, so arena allocators are used, which sidestep #2, and allow a pointer-increment allocation without needing #1. This is specifically because there are a lot of objects with exactly the same bounds on their lifetime. Special-purpose algorithms will almost always trump general-purpose algorithms when run on the workload they are optimized for.
extern void foo(T *p); // some arbitrary function
void bar1(bool cond)
{
..
auto p = std::make_unique<T, your_deleter>();
if (cond) { return foo(p.release()); }
...
}
This requires the compiler to call your_deleter::operator() regardless of whether cond is true or false, even though it's unnecessary (and can thus be slower) in the case where cond is true. Moreover, the obvious way to avoid it is to write it "C-style": void bar2(bool cond)
{
..
auto p = new T();
if (cond) { return foo(p); }
...
your_deleter()(p);
}
which can up being faster when cond is true. But this isn't something an expert would generally want to do, as now the C++ code becomes unidiomatic, fragile, and unmaintainable.In an ideal world, though, you could have an optimizer smart enough to do that transformation automatically. C++ compilers already do that in trivial cases, but they can't do it in general. My impression is that their Haskell compiler exploits the internal knowledge of what your_deleter does (i.e. reference counting) in order to optimize the code in various ways, like optimizing out such code, consolidating refcount updates, etc. And if I understand this correctly, there's no surprise at all that it can be faster than idiomatic C++ code written even by experts.
The question for me isn't the expertise of their programmers. Perhaps in their case they genuinely do need to have lots of objects on the heap, have (say) tight loops where they (for whatever reason) nevertheless cannot avoid the heap allocations, and don't have much of a use for finalizers besides freeing memory. In which case, I'm not surprised their solution clearly delivers better results than the C++ equivalent. The question from me, instead, is how well they think that generalizes, such as to (a) well-written Haskell programs in general, (b) well-written C++ programs in general, and/or (c) other domains. It would be one thing if their solution delivers better results in Haskell than C++ for their use case; it would be another thing if they could claim their solution delivers better results in Haskell than C++ for most use cases.
It's very easy to make a reference counting scheme not stop the world. It's a bit more difficult to make a GC implementation so.
This is a non-issue.
The point is that check itself is an extra instruction (or two, rather) that would otherwise be skipped entirely.
I'm not saying this commonly makes a difference. I'm just saying this might be something that does make a difference for them in their particular use case.
Also note that I was trying to describe the general phenomenon with a simple example, but this obviously isn't limited to std::unique_ptr.
That only works when you have infinite memory (or infinite CPU resources).
> It's very easy to make a reference counting scheme not stop the world. It's a bit more difficult to make a GC implementation so.
Only if by "not stopping the world" you mean your previous suggestion (leaking unbounded amount of memory to free() everything at some later point). When your memory is bounded, you will eventually have to stop/crash once you have run out of it.
AFAIK, the best modern state of art garbage collectors have stop-the-world pauses, proportional to size of root set and/or thread count. I'd love to see an RC implementation, that does not have stop-the-world pauses at all, but that sounds as audacious as claims of perpetual motion machine.
The Linux kernel uses RCU to manage data structures that are concurrently read and updated. When an old value (after an update) is no longer needed, a thread can either:
(a) block for a while to make sure no other thread is using it using synchronize_rcu() and then free the memory (see https://www.kernel.org/doc/html/latest/core-api/kernel-api.h...)
(b) if the thread cannot block, it will use call_rcu to register a callback to free the memory at a later time (https://www.kernel.org/doc/html/latest/core-api/kernel-api.h...). That callback generally runs in some other thread to do the cleanup.
Now, moving the concepts to user space, a typical user space implementation will just launch a dedicated thread to free all memory that is no longer needed by any RCU data structures.
You could go a bit further and have multiple concurrent threads mark the references, then sweep them up in a separate thread, too. Some sort of concurrent sweep and mark reference count system.
There are some cases where allocations are made when they could have been avoided. Iterating over a dictionary creates a single IEnumerator object. Async methods, tuples, delegates, and lambda expressions also allocate memory as do literal strings. It is possible to have struct-based iterators and disposers. There are some recently added mitigations such as a ValueTask, ValueTuple, function pointers, ref structs, conversions of literals to read-only spans, that eliminate allocations.
DateTime is a value type and doesn't allocate memory. Getting the current time does not allocate memory.
With the recent additions to ref types and Span<>, C# provides a lot of type-safe ways to avoid garbage collections. You can always use pointers if need be.
Besides the runtime also does C++ since version 1.0, and any language can tap into it.
Perhaps a nitpick: memory management time, yes, but not GC time.
alloc/free is manual memory management, not garbage collection.
More nitpicking: on x86-64, SMI interrupts can cause arbitrary pauses even without any software control involved. Hard realtime on x86-64 is not possible.
Orrrrr, GC pause just means pauses caused by the GC as part of its implementation's work to manage memory.
Apropos hacks, have a look at this memory management 'thing' that doesn't move, but manages to compact: https://github.com/plasma-umass/Mesh
There appear to be two ways the terms are categorized:
1. "reference counting" and "garbage collection" are two types of automatic memory management/reclamation.
2. "reference counting" and "tracing garbage collection" are two types of garbage collection.
I think mbrodersen is using #1.
(Back in the 90s and 2000s I feel like #1 was much more prevalent. But #2 seems to have gained popularity since then. My theory is that whoever started writing the Wikipedia content for this stuff picked #2.)
Furthermore, pretty much every language has I/O, but that doesn't make the term useless.
Note that this mostly reflects automatic reference counting, like you have in Python or Swift, not so much manual reference counting like std::shared_ptr or Rc.
Java has had "var" since at least Java 11 (current version is 18, with version increasing every 6 months, so you're talking about Java from 4 years ago).
With new features every 6 months, yes, 4 years ago is an eternity in Java world these days... 4 years in the future, almost certainly Java will already have virtual threads (like Go coroutines), full support for pattern matching on records (ADTs like OCaml), true value types (Project Vallhalla) and some other stuff that is making other languages stay behind Java in many aspects.
But things like auto are more verbose semantics, not less!
And isn’t var in Java like auto in C++?
(Though I'm one of the weirdos who likes Java in it's explicitness, which is related directly to its verbosity. I like reading code where I can see what the local variable types are.)
But even then, arrays of value types were available since C# 1.0 (2001).
If you can't understand what's happening when an object gets freed, it may be a sign that your code is too tightly coupled and/or becoming spaghetti. I've found that the more graph-like my data structures become, the more inadvertent complexity I'm adding. That's the whole reason we talk about data normalization, is to avoid these types of couplings.
Do you honestly claim that you know when deallocations happen in any codebase full of conditionals depending on outside effects (user input, network, etc)?
Yes. C programs have been doing this for over 40 years now. A leak free C program has an equivalent free for every malloc, which means they know exactly when everything gets allocated and freed.
GC can throw a spanner in that by deciding that now is the time to do its thing.
The object graph can easily be different from run to run depending on user input, timing, etc. Say, in the first run, the graph has a large subgraph due to a runtime condition (a property is set to a certain value), and on the next run, the graph is more lightweight because the property was not updated. It will take different amounts of time to collect such graphs (especially if they have destructors). I don't see how GC is any different here, it also depends on input, timing, etc. It's true though that a programmer has less control of GC because the rules when it is triggered are not always 100% clear (implementation details of a particular runtime) but imho they are as deterministic as RC; RC is just a form of GC which has very simple rules which are eaiser to understand.
If you think GC pauses are measured in milliseconds, and not microseconds, then you're reading horribly outdated material. You're off by three orders of magnitude for an off-the-shelf GC with no tuning.
Even if you go for a GC that is designed to balance throughput and latency, like G1, you can still configure what pause times it should target and those can easily be ~10-20msec if you want, with the vast majority of pauses being far less than that (less than 3 msec).
"I was recently reading some blog posts of the V8 JS engine team. There is nothing to understand or not understand there"
Look, if your knowledge of GC comes from reading V8 blog posts then you're proving the grandparent's point pretty nicely. You leaped from an extremely basic and remote understanding of GC to "there's nothing to understand or not understand here".
Also yes, pauseless GC tends to have higher overheads than GC that pauses for longer. Whether that matters or not depends a lot on the use cases and actual size of the overheads. For example ZGC is pauseless but not yet generational. Generational ZGC when it launches will improve throughput significantly.
Google haven't done pauseless GC for V8. I don't know why not because they have done one for Android. ART uses a fully concurrent collector iirc. At any rate, although you can't import a JVM GC to V8, you can run JavaScript on the JVM using GraalJS and use the GCs that way. Though I don't recall off hand if Graal supports ZGC yet. There's no deep reason why it couldn't.
IDEs are an enabler for horriblyLongIdentifierNames because they enter them for you automatically without requiring you to type them on a keyboard.
And god forbid you have optional bool args!
If I need to call a monster like that repeatedly, I'm likely to make my own wrapper that's simpler anyway - so why not start out that way.
foo(a=1, b=2, c=3)
is shorter (and, I would argue, clearer) than foo().a(1).b(2).c(3)
and requires only one method declaration and one method call.Also for constructors you don't have to worry about partial or out-of-order initialization. Is the above the same as
foo().c(3).b(2).a(1) ??
What if one call involves opening a file or allocating some resource that is used by another call?One nice thing that method chaining does give you is being able to differentiate external and internal parameter names. Swift (for example) supports this with argument labels for doing things like
insert(something, into: list, at: position)
rather than having to use the variable name: insert(something, list:list, pos:position)
Of course I wish that Swift supported ObjC/Smalltalk-like syntax so you could just do insert a into: list at: position
Note SwiftUI actually uses method chaining for some reason. It's OK but I would have been fine with named parameters.But I do like Smalltalk's method chaining/cascade syntax:
robot walkForward; wave; sitDown.The borrow checker doesn’t know when will the scope end, it only knows that however it happens upholds the invariants it cares about. You might call two entirely different code path inside a method and rust will only execute the dealloc logic at the end at a non-deterministic time. But maybe we just use different terminology here.
I do see what you're trying to get at, I think, but it's also worth noting that the use of stuff like arenas and vectors to absorb the cost of the repeated reallocations goes a long way here towards making deallocation times predictable in practice (if not in theory). It is certainly the case that you can mostly reduce the deallocation overhead to ~ zero for any particular part of your Rust program without that much effort, unless you are writing an interpreter for a different language that expects GC semantics (at least, that's been my experience).
The ONLY reason to reference count is when you need GC-like behavior on a few objects, but do not want to impose a GC on all objects. It is a very valuable tool in performance code not because it is fast, but because it allows you to make other things fast. Suggesting that you should reference count the world is patently ridiculous. This is the reason for Rust's Arc and C++ shared_ptr, not that they are faster than a GC.
The blog post completely brushes aside the costs of _atomic_ counter increments and decrements, calling them "just increments." The "atomic" is the key performance problem, not the increment.
Reference counting also makes objects larger so small objects cannot fit inside a machine register.
Modern GC algorithms are very efficient. They do not need to pause the world, and they do not need to be jittery. However, someone who doesn't understand how expensive atomic ops are probably wouldn't understand this either.
Though yes reference counting does increase the size of small objects. I find the performance edge depends on the use case. In some scenarios tracing GCs can be faster than non-atomic RCs, but sometimes they're slower. Sometimes I swap between the two just to see. Generally though RCs tend to use less RAM which can be very valuable.
To be fair, Python reference counts the world. And old Python only had reference counting, they added tracing garbage collection later.)
Python is not a fast language, of course. But neither is it a ridiculous language.
Also, when you have a global interpreter lock, you don't have to do anything inside that interpreter with atomics. Reference counting would be blazing fast in most cases where it can be done without atomic ops.
1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.
2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).
3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)
4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.
Or you can do what Erlang does: Erlang has neither mutation nor laziness, so you can't create cycles. The GC also lays out object in topological order in memory, so that you can detect garbage without tracing every life object.
I just prefer to write cycle free code when I can and turn off the cycle collector. Though Nim’s cycle collector performs well and does some tricks using the RCs.
For a simple example, take a text editor (sure, you would likely allocate a much bigger buffer in practice) that allocates for each line of text an object, and adds the buffer’s pointer to a list to be freed - this freeing happens when the user closes the open text file window. While you do know that every allocation will be freed and know their relative order, you don’t know anything more specific - will the user open multiple such files first, close them in some random order, etc.
> Do you honestly claim that you know when deallocations happen in any codebase full of conditionals depending on outside effects (user input, network, etc)?
And the answer to that is yes. You even admitted that here:
> While you do know that every allocation will be freed and know their relative order
This is a lot more specific than "the GC will free this memory at some indeterminate point that it deems acceptable".
Additionally, in your specific example:
> this freeing happens when the user closes the open text file window
So you can plan for that. You can pop up a saving screen if you run tests and realize that the deallocations take a bit of time. With GC, it's luck of the draw. I'm speaking from experience.
I wrote a tool that used Roslyn to do some transpiling of a custom format in our company. It was very important that it ran fast, since this algorithm was going to be run in a time sensitive situation. And it had to free it's memory as soon as it was done using it, since it was hot swapping the DLLs and I needed to make sure I wasn't getting name collisions from DLLs that were waiting for the GC to run to fully unload the old DLLs. I tried so many different ways to tell C# GC to collect the object tree that I knew was no longer necessary, but it was seemingly impossible.
Microsoft even has a page dedicated to debugging why an assembly won't unload and it reads:
> The difficult case is when the root is a static variable or a GC handle.[0]
Now you may say that this while problem only arises because of my weird specific use case, which is true but I couldn't change my requirements since those were hard requirements from my company. This could all easily be solved if a GC allowed you to define the concept of ownership. I had references hanging around that didn't matter because the object holding the references didn't own that memory.
All that to say, yes you can know exactly when you're memory will be deallocated in a language like C. In a language like C#, your left to the whims of the GC which can be a deal breaker in lots of cases.
What really gets me too, is languages like C# end up creating DI frameworks with the "novel" concept of lifetime requirements to make sure object lifetimes are properly scoped. What the heck? It's got a GC. Why did they go through all that trouble if the GC just cleans it up for you? If I have to think about object lifetimes, I may as well switch to a language that makes that explicit rather than a language that obfuscated it as much as possible.
[0]: https://docs.microsoft.com/en-us/dotnet/standard/assembly/un...
But I don’t think that giving up the comfort/performance of a good GC is a good tradeoff in all the other cases. The same way Rust et alia can opt into some form of GC with (A)RC, GC languages can have escape hatches as well.
https://source.android.com/devices/tech/dalvik/gc-debug#art_...
It also does mixed AOT / JITC, amongst other tricks.
The saving grace for the reference counter is that this is deterministic. The pause is an exact function of the allocation / deallocation pattern which is under the programmers control. So the code can be written in such way that it avoids big delays at inopportune times.
Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:
1. Rust, in Rc<T> and Arc<T>
2. C++, in std::shared_ptr<T>
3. Objective-C, in NSObject's retainCount
4. Swift, because of Objective-C
5. Python, where reference count is Py_ssize_t
These were literally the first languages i thought of, they all use 64-bit types. I would argue since reference counting is much rarer than GC, these make up the bulk of reference counting use in the real world. "Almost everyone", huh? It's a bit rich accusing the author of being "almost incoherent" and then say this.
> Reference counting can have arbitrarily long pauses as well.
Deallocating can take arbitrarily long, but deallocation does not stop-the-world. It stops the current thread, which is a huge difference. Malloc can take arbitrarily long as well, it's not like it's wait-free.
In addition, the GC pauses in regular programming languages are frequently orders of magnitude longer than deallocation. You would have to deallocate an enormous tree of object at the root for this to be an issue. And GCs have to do that as well, in addition to their regular stop-the-world pauses. This argument is just irrelevant.
> The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)
First off, this is not why Python has a GIL, but lets leave that aside.
Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.
> you still need a GC anyway to deal with objects that get too many references.
This is just silly. In languages which has both reference counting and traditional garbage collection (e.g. Python), they do it to avoid reference cycles, not because objects get "too many references". That is a ridiculous statement.
In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:
https://instagram-engineering.com/dismissing-python-garbage-...
I think it's still an open question if reference counting is always faster than GC, but I do not believe you have the technical expertise to evaluate such a claim. Your comment is four paragraphs long and riddled with factual errors. If you want to be convincing, show data that is better than that Instagram case study.
This is actually part of why Python still has the GIL. A GILECTOMY was attempted and multithreaded atomic refcounting made things a lot slower (going up with the number of threads) and even other methods were not sufficient for performance.
And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.
And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.
Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.
If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).
> In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:
...because generations in the GC are represented as linked lists, and modifying those linked lists reduces the number of page faults, in turn because pages are shared by copy-on-write between process. That representation isn't the best idea already, and shared structure should be in the oldest generation anyway, as you don't want to collect it (as SBCL does for images, which are loaded with copy-on-write).
[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4
In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.
Whether that is actually important or not depends on the use case. Personally, I think that GC is plenty good enough for most GUI apps other than games, and allows for non-contorted modelling of said GUI (e.g. with backreferences where they make sense).
Do you mean that the user will know? Well sure, that's the pain point to avoid in this case. Anyone who has tried to quit certain versions of various browsers after a long session with many tabs, etc. will know this pain when closing a window. Server side applications can have similar issues.
Or do you mean the code will "know"? That is, the code will need to predict, at runtime, that a code path will be expensive and choose a memory release strategy based on some criteria?
Or do you mean the designer of the code will know, and avoid RC before implementing?
Honest question, I'd like to understand your perspective. Thanks.
But it does mean that C++ is not a good example here. I would argue that Rust isn't, either, since ARC is also an occasional opt-in there rather than the default.
> you know when deallocations happen in any codebase full of conditionals depending on outside effects
What I'm saying is that the author of the code, and anyone else who can read and understand it, will know that, if the user does X, Y, and Z, it'll trigger a deeply nested release of an object graph that'll cause a period of non-responsiveness visible to the user.
Whether the author will consider this acceptable or not is a different question.
Though I’ve been hacking on a UI framework written in Nim. Its ui nodes form a tree structure that only tracks parent => child. When it needs the parent info it makes a stack of parents during processing.
Essentially it moves the cycle collector / GC work to extra work during processing. But it’s cheaper since you’re already accessing the memory.
It's pretty hilarious to me that you first say "they have to move it to another generation" and then you say "it's free!" It's like saying "I paid for my dinner, and now I get to eat it for free!"
Also: what do you think `free()` does? All modern memory allocators do this trick, keeping thread-local caches. This is not an advantage of GCs when reference counting does it as well.
Almost all modern GCs are stop-the-world in at least phases, and for good reason: stop-the-world GCs are higher performance. You pay in other ways for skipping that stop.
> And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.
Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc
18.5 nanoseconds fits under no reasonable definition of "crazy expensive", not when a regular increment clocks in at 5.9 nanoseconds. And there is extremely few situations where you increment a reference count more than, like, 5 times. It's just not an issue.
This is like cargo cult programming: you've been told these things and never tested them in the real world, and you have all these preconceived notions that just don't stand up to two minutes of scrutiny.
> Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.
Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.
I'm done with this thread now, unless anybody can show me any actual data to back anything you say up. It's really tiring. I started this by saying "I don't actually know", and everyone replying to me has been so darn certain of everything they say while being outright incorrect about most things, and without any actual data to back up the rest.
Congratulations. You tested a construct meant for multicore/threading in a single threaded benchmark and then marvel at the low overhead.
Of course you will only start seeing the cost if there is actually contention to operate on the value between multiple threads running simultaniously. See.: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
Indeed the issue with measuring barriers is that measuring the barrier doesn't suffice; one wants to measure how the barrier affects the rest of execution. This entails coming up with programs that are much less trivial than repeatedly incrementing a counter.
> > Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form. > Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.
I am talking about performance exactly. Java's GC will smoke the hell out of C++'s shared pointers and Rust's (A)RC. Noone said anything about productivity/ease of usage.
And as mentioned by another commenter - your benchmark didn't take into account anything related to parallel execution, which would be the point.
You will see that the counter increment is about 2-5 cycles, a few hundred ps, and the atomic is on the order of 10 ns uncontended.
If you then introduce contention and a multi-socket setup, the atomic will slow down significantly. Only one thread can touch it at a time, so they have to take turns.