Reference count, don't garbage collect

Reference count, don't garbage collect(kevinlawler.com)

189 points by kcl 3 years ago | 406 comments

shwestrick 3 years ago |

This debate has gone round and round for decades. There are no hard lines; this is about performance tradeoffs, and always will be.

Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...

One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.

deterministic 3 years ago | |

I am the maintainer of a very high-performance JIT compiler for a Haskell like rules programming language used by large enterprises around the world. It uses reference counting + a global optimisation step to reduce the reference count updates to an absolute minimum. The result is compiled code that runs faster than C++ code carefully hand optimised by C++ experts over a 10 year period. There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”. Which is not common terminology. It also (BTW) scales linearly the more cores you throw at it.

brabel 3 years ago | | |

You've gone from claiming reference-counting is faster than tracing GC to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible. But anyway, it's a completely fruitless discussion here unless you provide data that we can look at and scrutinize. OP hasn't provided any. You haven't provided any (and I do believe you may think you're right, but I've been in the position of being very confident of something just to be proven completely wrong by giving all my data to others to scrutinize... it's disheartening but necessary to get to the bottom of what's real). It's like the V language saying it can do memory management magically and it's much faster than Rust or whatever when they don't even have a working system yet.

yakubin 3 years ago | | |

What happens in your language when a linked list is freed? Doesn't running its destructor (or its equivalent) take a linear amount of time relative to the length of the list?

pclmulqdq 3 years ago | | |

The global optimization step is often what people commonly refer to as "garbage collection." Putting it inside a framework to RC as few times as possible is pretty cool.

However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.

viraptor 3 years ago | | |

How do you collect cycles without a pause?

tsimionescu 3 years ago | | |

free() calls that have to run for a data-dependent amount of time are more or less equivalent to GC pauses (assuming a concurrent GC that doesn't need to stop the world, like Java's). The most typical example is free()-ing a a linked list, which takes O(n) free() calls to free with a simple RC mechanism.

naasking 3 years ago | | |

> There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”.

Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.

refulgentis 3 years ago | | |

I'm not sure whats being asserted here, could you explain more? This sounds like you're describing a non-stopping GC, and its well understood reference counting is garbage collection. I'm not sure how the rest applies, you're correct, it is possible to write software with just malloc and free.

Bolkan 3 years ago | | |

Pics or gtfo.

hamstergene 3 years ago | |

For a unifying term I prefer Automatic Memory Management.

One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.

Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.

kibwen 3 years ago | | |

If you want to get even more precise, call it automatic dynamic memory management. Automatic static memory management would be something like Rust's scope-based memory reclamation via ownership.

Someone 3 years ago | | |

> not every AMM technique works by producing garbage then collecting it

And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.

GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.

RC determines an object has become garbage when its reference count goes to zero, and then collects it.

That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.

That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)

omginternets 3 years ago | | |

What are these other techniques and what can a technically-literate newcomer like myself read to get acquainted?

pjmlp 3 years ago | | |

I just keep feeding them the respective CS literature.

kaba0 3 years ago | |

One great example would be a C++ program that runs fast, and then just spends 10s of seconds doing “nothing” while it deallocates shared pointers’ huge object graphs at the end. They really are two sides of the same coin, with tracing GCs being actually correct (you need cycle detection for a correct RC implementation), and having much better throughput. It’s not an accident that runtimes with thousands dev hours are polishing their GC solutions instead of using a much more trivial RC.

hinkley 3 years ago | | |

I don't know what the current state of the art is, but at one point the answer to GC in a realtime environment was to amortize free() across malloc(). Each allocation would clear up to 10 elements from queue of free-able memory locations. That gives a reasonably tight upper bound on worst case alloc time, and most workflows converge on a garbage-free heap. Big malloc after small free might still blow your deadlines, but big allocations after bootstrapping are generally frowned upon in realtime applications, so that's as much a social problem as a technical one.

pklausler 3 years ago | | |

It's not marketed as a GC, but exit(2) is fast and effective when used as one.

im3w1l 3 years ago | | |

Fun fact: if you dont do anything important in the destructors you can avoid that delay by intentionally leaking the memory. The os will clean it up when the program exits and it does a better job since it frees the pages rather than looking at your objects one by one.

pornel 3 years ago | |

Saying that both have pauses is a false equivalence to me.

It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

RC with borrow checking can avoid a lot of refcount increments.

Tracking GC typically needs write barriers, so it’s not free either.

kaba0 3 years ago | | |

> and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.

> RC with borrow checking can avoid a lot of refcount increments.

That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)

im3w1l 3 years ago | |

If you use refcounted pointers for everything then you'd be better off with a proper gc. But at least in the programs I see, 99% of objects are not refcounted , and that is reserved for a tiny majority of objects with especially tricky lifetimes.

ncmncm 3 years ago | | |

This is the key.

Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.

Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.

Typical GC passes consist of little else but pointer chasing.

But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.

Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.

Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.

If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.

Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.

eru 3 years ago | | |

Nowadays good compilers can handle many of these non-tricky lifetimes statically, too.

tuetuopay 3 years ago | |

> One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count.

This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.

For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.

arcticbull 3 years ago | |

RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.

There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.

klodolph 3 years ago | | |

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from time to time.

Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.

Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.

One of the problems with this is that reference counting touches all the memory right before you're done with it.

> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.

This is not a typical solution.

Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.

Garbage collection does not need to be used to collect all objects.

> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.

When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."

Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".

tsimionescu 3 years ago | | |

> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.

bitcharmer 3 years ago | | |

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.

garethrowlands 3 years ago | | |

Pointer chasing is expensive because fetching a random location defeats locality and therefore caches the the CPU's stream detection.

But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.

int_19h 3 years ago | | |

Most GC implementations that I can think of know which bits in memory are pointers and which aren't, and only scan the pointers.

waterhouse 3 years ago | |

> Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.

"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.

kaba0 3 years ago | | |

> It's much more complex and expensive to do this for tracing live data

But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?

pjmlp 3 years ago | | |

And so it becomes a simple tracing GC implementation, while one keeps calling it RC to feel good.

zozbot234 3 years ago | |

You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.

(Deterministic deallocation also means having to recursively free unreachable objects. That's often described as an arbitrary "pause" behavior in RC systems, but it's actually inherent in the requirement for deterministic behavior. If you don't care about determinism for some class of objects, you can amortize that pause by sending them to a separate cleanup thread.)

pkolaczk 3 years ago | |

Pausing a single thread to release memory is not what is considered a "pause". Even if pausing a single thread could be a problem, you can trivially offload releasing to another thread. A pause is when all application threads get paused. Hence, reference counting does not have a problem of pauses, and tracing often does.

hinkley 3 years ago | |

What is your read on the lack of discussion of escape analysis?

My own read on this is that it blurs the line with deferred collection/counting, because you could either use it to complement deferral making it cheaper, or avoid deferral because you're getting enough of the benefits of deferral by proving objects dead instead of discovering that they are dead.

pjmlp 3 years ago | | |

Likewise it means that heap allocation on a tracing GC never took place and the object was allocated on the stack, or if small enough, on registers.

osigurdson 3 years ago | |

One thing that is particularly strange in C#, is objects can respond to events / delegates after they have gone out of scope. It can be quite a while before the object is actually collected - especially if it ends up on the large object (85K+) heap. This seems like an incredibly leaky abstraction to me. The whole "using" concept is a bit of an abomination as well = though getting better.

The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.

cpitman 3 years ago | | |

That doesn't sound quite right. When a object attaches a handler to an event, that creates a reference from the event source to the event listener. Until the handler is unattached, the event listener shouldn't be collected by GC, since it is technically still alive.

This lead to one of the more entertaining C# memory leak stories, where Princeton's entry to the DARPA Grand Challenge ended up failing because every frame they detected obstacles, created a class for each, and subscribed each obstacle object to an event. They missed that the event subscription was keeping those objects alive, and every piece of tumbleweed in the desert helped leak memory until the car just stopped! https://www.codeproject.com/Articles/21253/If-Only-We-d-Used...

Matthias247 3 years ago | | |

I don't get that part. If it can receive an event/delegate, it means the object needs to be referenced by another object which invokes the event. If that is the case - how would be eligible for GC at all?

olvy0 3 years ago | | |

It's a "feature" on the language, like others said below.

The codebase I work with has had many pathological crashes due to this behavior.

So basically in C# when you use += to subscribe to events, in a big system where lifetimes of objects are independent of each other, you're back to a C/C++ mindset where you should check you have a -= call for the subscribed object when the subscribing object is about to run out of scope. Else you get random crashes, when you get events delivered to an object that should have been dead.

This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.

There's WeakEventManager [0] but that's available only in "classic" dotnet framework (and in "new" dotnet but only if you're targeting Windows) since it lives in the WPF namespace. It can be used outside of it, but you still take a dependency on System.Windows.

There are some other bespoke solutions too.

There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.

[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...

[1] https://github.com/dotnet/runtime/issues/18645

admax88qqq 3 years ago | | |

On the other hand such behavior can be a blessing in some situations. Maybe I do just want to hang an object off of some pubsub without having to decide the one true "owner" of the object.

If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.

torginus 3 years ago | | |

Considering that C# has a major role in desktop development, and interacts with platform APIs and objects a lot as a result, these kind of weird behaviors coming from conflicting ideas about object lifetimes happen a lot - it's weird they chose a GC for the language.

citrin_ru 3 years ago | |

> reference counting has pauses while tracing garbage.

Which pauses you are meaning?

Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).

goodpoint 3 years ago | |

It's not always a tradeoff:

Nim switched from GC to RC and it even increased performance.

cwaffles 3 years ago | |

I disagree with the statement that modern memory architectures have much higher read bandwidth vs write bandwidth.

Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...

https://www.anandtech.com/show/2525/5

yxhuvud 3 years ago | | |

While perhaps true, what needs to be compared here is read, vs read + write, no? Just writing isnt enough. And then we are at a factor above 2, assuming no thread contention. If there is contention, it can be a lot higher.

adrianN 3 years ago | | |

While "write bandwidth" is probably not the right term, writes are more expensive because you need to update caches. If you forked you might need to copy-on-write the page before you can write to it.

MaulingMonkey 3 years ago | |

> Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true.

When I can easily replace the deallocator (thus excluding most non-RC production GCs), I can (re)write the code to avoid GC pauses (e.g. by amortizing deallocation, destructors, etc. over several frames - perhaps in a way that returning ownership of some type and its allocations to the type's originating thread, and thus reducing contention while I'm at it!) I have done this a few times. By "coincidence", garbage generation storms causing noticable delays are suprisingly uncommon IME.

As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.

It's also been my experience that tracing or profiling any 'ole bog standard refcounted system to find performance problems is way more easy and straightforward than dealing with the utter vulgarity of deferred, ambiguously scheduled, likely on a different thread, frequently opaque garbage collection - as found in non-refcounted garbage collection systems.

So, at best, you're technically correct here - which, to be fair, is the best kind of correct. But in practice, I think it's no coincidence that refcounting systems tend to automatically and implicitly amortize their costs and avoid GC storms in just about every workload I've ever touched, and at bare minimum, reference counting avoids GC pauses... in the code I've written... by allowing me easier opportunities to fix them when they do occur. Indirectly causal rather than directly causal.

> if you read a heap-object out of a data structure, you have to increment the object's reference count.

This isn't universal. Merely accessing and dereferencing a shared_ptr in C++ won't touch the refcount, for example - you need to copy the shared_ptr to cause that. Rust's Arc/Rc need to be clone()d to touch the refcount, and the borrow checker reduces much of the need to do such a thing defensively, "in case the heap object is removed out from under me".

Of course, it can be a problem if you bake refcounting directly into language semantics and constantly churn refcounts for basic stack variables while failing to optimize said churn away. There's a reason why many GCed languages don't use reference counting to optimize the common "no cycles" case, after all - often, someone tried it out as an obvious and low hanging "optimization", and found it was a pessimization that made overall performance worse!

And even without being baked into the language, there are of course niches where heavy manipulation of long-term storage of references will be a thing, or cases where the garbage collected version can become lock-free in a context where such things actually matter - so I'll 100% agree with you on this:

> There are no hard lines; this is about performance tradeoffs, and always will be.

jerf 3 years ago |

From what I can see, the myth that needs to be debunked isn't that garbage collection is super fast and easy with no consequences, it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection. I see far more "I'm writing a web server that's going to handle six requests every hour but I'm afraid the garbage collector is going to trash my performance" than people who believe it's magically free.

It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.

It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.

Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.

flohofwoe 3 years ago |

Such a claim really needs hard data to back it up. Reference counting can be very expensive, especially if the refcount update is an atomic operation. It's hard to capture in profiling tools because the performance overhead is smeared all over the code base instead of centralized in a few hot spots, so most of the time you don't actually know how much performance you're losing because of refcounting overhead.

The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

okennedy 3 years ago | |

This. Exactly this.

Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.

arcticbull 3 years ago | |

> The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

RAII gets you a lot of the way there.

jsnell 3 years ago |

> Basically, you attach the reference to the object graph once, and then free it when you're done with it.

So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.

> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?

Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)

Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.

yyyk 3 years ago |

Reference counting is garbage collection, just a different strategy - and all these strategies tend to blur to the same methods eventually, eventually offering a latency-optimized GC or a throughput-optimized GC.

Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.

smasher164 3 years ago |

For how strongly worded this article is, you'd think the author would provide some substance in their reasoning. Reference counting, even atomic, is quite expensive. Not only because it can invalidate the cache line, but depending on the architecture (looking at you x86), the memory model will deter reordering of instructions. On top of this, reference counting has a cascading effect, where one destructor causes another destructor to run, and so on. This chain of destructor calls is more or less comparable to a GC pause.

bjourne 3 years ago |

Time to tout my own horn. I made a project comparing different types of garbage collectors (I still prefer the original terminology; both ref-counting and tracing garbage collection collects garbage, so they are both garbage collectors) a few years ago: https://github.com/bjourne/c-examples

Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:

    Copying Collector                                8.9
    Reference Counting Collector                    21.9
    Cycle-collecting Reference Counting Collector   28.7
    Mark & Sweep Collector                          10.1
    Mark & Sweep (separate mark bits) Collector      9.6
    Optimized Copying Collector                      9.0

I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.

In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.

nemothekid 3 years ago |

It's my theory that Java, unintentionally, did a lot of damage to P&L research. I write a lot of Rust, and while the borrow checker is great, I've come to really admire the work that was put in the Go GC even if it's not as fast Java.

There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.

slavboj 3 years ago |

People have been managing garbage collection schedules for decades now. It's quite possible for many systems to have completely deterministic performance, with the allocation/deallocation performance made extremely fast, gc restricted to certain times or a known constant overhead, etc. Ironically, from a programming perspective it's incredibly easy in a language like Java to see exactly what allocates and bound those cases.

Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.

Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.

imtringued 3 years ago |

Not only does the author ignore the huge progress in conventional garbage collected languages like Java, he also dismisses GC as inherently flawed despite the fact that the common strategy of only having one heap per application has nothing to do with garbage collection. In Pony each actor has its own isolated heap which means the garbage collector will only interrupt a tiny portion of the program for a much shorter period of time. Hence the concept of a stop the world pause is orthogonal to whether you have a GC or not. One could build a stop the world pause into an RC system through cycle detection if desired.

Waterluvian 3 years ago | |

I’m out of my league so this may be dumb, but does any language or VM or whatnot have a combined system where each thread has its own heap, and you can talk by passing messages, but they also have a common heap for larger data that’s too expensive to pass around, but at the cost that you have to be much more careful with lifetimes or have to manage it manually or something?

tkhattra 3 years ago | | |

In the Erlang VM, each Erlang "process" has its own garbage collected heap [1]. There's also an ETS module for more efficiently storing and sharing large amounts of data, but data stored in ETS is not automatically garbage collected [2].

[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html

jefftk 3 years ago | | |

If you squint at it right you could say C works that way: you have many processes each with their own heap and they can pass messages, but if they want larger data that's too expensive to pass around you can use shared memory.

Jweb_Guru 3 years ago | | |

I believe Nim works this way, among others.

jeffmurphy 3 years ago | | |

I believe Erlang works this way.

Bolkan 3 years ago | | |

Dart

viktorcode 3 years ago | |

> he also dismisses GC as inherently flawed

It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.

RC is a performance compromise.

ridiculous_fish 3 years ago |

There's a lot of discussion of comparative performance, but most software isn't performance sensitive so it just doesn't matter. But there's another major facet: FFIs. The choice of memory management has huge implications for how you structure your FFI.

JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.

Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.

Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.

One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!

chubot 3 years ago | |

I agree that the API and interop with C/C++ is a huge issue and something I haven't seen good articles on.

But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.

With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.

Moving is indeed a big pain, and I'm about to back out of it for Oil :-/

----

edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:

https://news.ycombinator.com/item?id=32283641

I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.

kaba0 3 years ago | |

It’s a good point that it is a topic that doesn’t get enough coverage, but let’s just add that it has good solutions for most use cases: GCs can pin objects that might be used from some other language.

knome 3 years ago |

Reference counting can also have unpredictable hits if you release any large data structures. Whoever drops the last reference suddenly gets to sit through the entire deep set of items to release ( unless you can hand off the release cascade to a background thread ).

I've never heard of a reference counting implementation that can handle memory compaction.

Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.

When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.

eru 3 years ago | |

> I've never heard of a reference counting implementation that can handle memory compaction.

It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.

But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh

They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.

mamcx 3 years ago | |

> hits if you release any large data structures.

Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.

You can see this way on

http://sblom.github.io/openj-core/iojNoun.htm

ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.

kaba0 3 years ago | | |

How is that not the exact same for tracing GC?

umanwizard 3 years ago |

Case 3.: You don’t “want” the constraints of Case 2, but are in practice forced into them due to a huge, poorly-educated developer base being incapable of writing correct refcounted code or even knowing what a weak pointer is.

When I worked at Facebook, which is structurally and politically incapable of building high-quality client software, I was on a small team of people tasked with making heroic technical fixes to keep the iOS app running despite literally hundreds of engineers working on the same binary incentivized to dump shoddy code to launch their product features that nobody would use as fast as possible (did you know that at one point you could order food through the Facebook app, and that a whole two digit number of people per day used this feature? Etc.)

Objective-C has ARC (automated reference counting) — every pointer is a refcounted strong reference by default unless special annotations are used. What makes it worse is that large, deep hierarchies are common, making reference cycles leaking huge amounts of memory easy to create.

For example, the view controller for a large and complicated page (referencing decoded bitmap images and other large objects) is the root of a large tree of sub-objects, some of whom want to keep a reference to the root. Now imagine the user navigates away and the reference to the view controller goes away, but nothing in the tree is deallocated due to the backlink — congratulations, you just leaked 10 MB of RAM!

It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.

To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.

It was totally egregious undefined behavior, one thread tracing the heap with no synchronization with respect to the application threads that were mutating it, but the segfaults this UB caused were so much rarer than the crashes due to OOMs that it prevented that we decided to continue running it.

carry_bit 3 years ago |

You can optimize reference counting: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23...

With allocating as dead, you're basically turning it into a tracing collector for the young generation.

hayley-patton 3 years ago | |

https://users.cecs.anu.edu.au/~steveb/pubs/papers/lxr-pldi-2... is the most recent publication in this lineage of high-performance RC systems.

agentultra 3 years ago |

A paper I quite enjoyed on automatic reference counting for pure, immutable functional programming: https://arxiv.org/abs/1908.05647

It can be quite "fast."

miloignis 3 years ago | |

Indeed, and this line of research has been continuing to improve in follow on work on Perceus in Koka: https://xnning.github.io/papers/perceus.pdf and https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

Very cool stuff!

eru 3 years ago | |

If you have pure, immutable and strict, you can't create cycles. (That's what Erlang does for example.) That makes a lot of memory management techniques much simpler. Both tracing garbage collection and reference counting.

If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.

dpryden 3 years ago |

This article is naive to the point of being flat-out wrong, since it makes extremely naive assumptions about how a garbage collector works. This is basically another C++-centric programmer saying that smart pointers work better than the Boehm GC -- which is completely true but also completely misleading.

I'm not saying that GC is always the best choice, but this article gets the most important argument wrong:

> 1. Updating reference counts is quite expensive. > > No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all.

Yes, it is. Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".

Additionally, every modern GC does generational collection, so for the vast majority of objects, the GC literally does "nothing at all". No matter how little work it does, a RC solution has to do O(garbage) work, while a copying GC can do O(not garbage) work.

Now, that's not to say that GC is automatically better. There are trade-offs here. It depends on the workload, the amount of garbage being created, and the ratio of read to write operations.

The article says:

> I've already stated I'm not going to do benchmarks. I am aware of two orgs who've already run extensive and far-reaching experiments on this: Apple, for use in their mobile phones, and the Python project.

I can counterpoint that anecdata: Google extensively uses Java in high-performance systems, and invented a new GC-only language (Go) as a replacement for (their uses of) Python.

The right answer is to do benchmarks. Or even better yet, don't worry about this and just write your code! Outside of a vanishingly small number of specialized use cases, by the time GC vs RC becomes relevant in any meaningful way to your performance, you've already succeeded, and now you're dealing with scaling effects.

eru 3 years ago | |

> [...] and invented a new GC-only language (Go) as a replacement for (their uses of) Python.

That's not true. Go was invented with the intention of replacing C++ at Google. That didn't really work out, and in practice Go became more of a replacement of Python for some applications at Google.

Also there are some indications that Go didn't gain traction necessarily on the merits of the language itself, but more on the starpower of its authors within Google.

(I mostly agree with the rest of what you wrote.)

kgeist 3 years ago | |

>Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".

Atomic reference count may trigger cache flush in other CPUs/stall waiting for them to do that, so it's not so minimal indeed.

samatman 3 years ago |

Definitely use reference counting, it's better! Now you're avoiding cyclic data structures and it sucks, so maybe just for a few objects we'll put them on a linked list, maybe mark them, definitely sweep from time to time to see if anything is unreachable. I'm told there's an algorithm by Boehm.

Well, ok, let's go whole hog, we're collecting garbage again, and it sucks, we get all these baby objects, let's try and optimize the GC: we can keep, I dunno, a count of references to new objects, do some allocation sinking to see if we can avoid making them, put the babies in an orphanage, hey look, RC is GC, QED.

habibur 3 years ago |

It's interesting how we have come full circle from "Reference counting is the worst of two worlds [manual and GC] and will always be slower" to now "Well, we all know it's actually faster." in like 10 years.

flohofwoe 3 years ago | |

Except that "refcounting is faster than a GC" is mostly a myth, both are equally bad if predictable performance matters.

klodolph 3 years ago | | |

Looking at metrics from Go's garbage collector, what else do you want? GC pauses are damn low, and you'll see numbers in the sub 500μs range.

If I needed hard-realtime, I would avoid allocation entirely.

pclmulqdq 3 years ago | |

Its actually usually slower than both manual memory management and GC. It's only coming back now because people are finally learning how to make memory allocations large and rare.

This blog post is an answer to: "Tell me you haven't learned about cache coherence without telling me you haven't learned about cache coherence."

OskarS 3 years ago | | |

> Its actually usually slower than both manual memory management and GC

[citation needed]

You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.

My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.

rwmj 3 years ago | | |

Hiring would certainly be a lot easier if more people were to make bold, completely wrong blog postings like these. I could immediately give my negative recommendation without the time and hassle of a phone interview.

hinkley 3 years ago | | |

Has anyone done a good paper on how memory bank affinity for processors affects these costs?

spullara 3 years ago |

This is a modern GC:

https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html

Way better than RC.

viktorcode 3 years ago | |

See that last graph with memory overhead? Not everyone's definition of "better" allows for that.

danybittel 3 years ago |

He fails to mention that Apple added support for ref counting in silicon.

And.. often GC will be able to use area allocators, before falling back to "proper" GC allocation. Which will be a lot faster than ref counting everything.

And atomics can get very slow, I've had atmics show up regularly in the profiler.

For my project, the combination that works great so far: unbox all types, use area allocators if the compiler can guarantee the value doesn't escape, use GC for data that changes often and ref counting for data that hardly ever changes. (luckily cycles are not possible)

assbuttbuttass 3 years ago |

In practice, I don't see any reference counting approaches that do cycle detection.

I have an example from early in my career where I accidentally created a memory leak in Python from a cyclic reference between a closure and a function argument

https://stackoverflow.com/questions/54726363/python-not-dele...

malkia 3 years ago |

Clearly this person hasn't tried how this works on NUMA cpus. it's quite expensive to do these atomic inc/decs there, or even without NUMA... caches must be synced and flushed because of this.

brobinson 3 years ago | |

Yeah, surprised no one is mentioning this. (A)RC is awesome... for flushing caches. :-(

Jweb_Guru 3 years ago |

I'm sorry, but this is a very poorly reasoned article that does not engage with any of the serious work that's been underway to get reference counting competitive with tracing GC. This is evident from the very first point:

> 1. Updating reference counts is quite expensive.

> No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all. The comparison is the ongoing garbage collection routine, which is going to be more expensive and occur unpredictably.

First off, updating the reference count invalidates the entire cache line in which the reference count lives. For naive reference counting (which I'm assuming the author is talking about since they give no indication they're familiar with anything else), this generally means invalidating the object's cache line (and with an atomic RMW, to boot, meaning you need a bus lock or LL/SC on most systems). So right away, you have created a potentially significant cacheline contention problem between readers in multiple threads, even though you didn't intend to actually write anything. RC Immix, for example, tries to mitigate this in many creative ways, including deferring the actual reference count updates and falling back to tracing for reclamation when the count gets too high (to avoid using too many bits in the header or creating too many spurious updates).

Secondly, you know what's cheaper than an atomic increment or decrement? Not doing anything at all. The vast majority of garbage in most production tracing garbage collectors (which are, with the exception of Go's, almost exclusively generational) dies young, and never needs to be updated, copied, or deallocated (so no calling destructors and no walking a tree of children, which usually involves slow pointer chasing). Even where the object itself doesn't die young, any temporary references to the object between collections don't have to do any work at all compared to just copying a raw pointer, C style. This and bump allocation (which the author also does not engage with) are the two biggest performance wins that tracing garbage collectors typically have over reference counting ones, and solutions like RC Immix must implement similar mechanisms to even become competitive. You don't even need to go into stuff like the potential benefits of compaction, or a reduction in garbage managing code on the hot path (which are more dubious and harder to show) to understand why tracing has some formidable theoretical advantages over reference counting!

But what about in practice? Surely, the overhead of having to periodically run the tracing GC negates all these benefits? Well, bluntly--no, not even close. At least, not unless you care only about GC latency to the exclusion of everything else, or are using something fancier (like deferred RC). You can't reason backwards from "Rust and C++ are generally faster than languages with tracing GCs on optimized workloads" to conclude that reference counting is better than tracing GC--Rust and C++ both go out of their way to avoid using reference counting at all wherever possible.

None of this is secret information, incidentally. It is very easy to find. The fact that the author is apparently so incurious that they never once bothered to find out why academics talk about tracing GC's performance being superior--and the fact that it was so dismissive about it!--makes me pretty doubtful that people are going to find useful insights in the rest of the article, either.

benibela 3 years ago |

An advantage of RC is that you can also use it to verify ownership.

When the counter is 1, you can do anything with the object without affecting any other references.

Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.

dgan 3 years ago |

>> "The Python case is more inarguable. If GC is so good, why wouldn't Python just garbage collect everything,... ? It is because RC outperforms garbage collecting in all these standard cases"

Pretty weird argument for one of the slowest languages out there ...

cosmotic 3 years ago |

> This is about as minimal as you can get short of nothing at all.

With GC, you can do nothing at all. In a system with lots of garbage, you can do a GC by copying everything accessible from the GC root, then de-allocating all the garbage in a single free.

pizlonator 3 years ago |

Atomic inc/dec is hella expensive relative to not doing it. It’s true that CPUs optimize it, but not enough to make it free. RC as a replacement for GC means doing a lot more of this expensive operation - which the GC will do basically zero of in steady state - so this means RC just costs more. Like 2x slowdown more.

The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.

So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).

manuelabeledo 3 years ago | |

> So, GC is usually faster.

GC is way faster if there is little collection.

In memory or cache intensive applications, garbage collection as a whole can be significantly slower.

pizlonator 3 years ago | | |

GC is faster even if you collect a lot. GCs create better cache locality especially for recently allocated objects, and their cache behavior is not generally worse than malloc (but there are many GCs and many mallocs and some try harder than others to make caches happy).

The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.

Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).

It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.

That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.

pjmlp 3 years ago |

Another RC advocate that misses the point about RC being a GC algorithm from CS point of view.

https://gchandbook.org/

byefruit 3 years ago |

This article provides very little evidence for it's claims and seems to only have a superficial understanding of modern GCs.

"Increments and decrements happen once and at a predictable time. The GC is running all the time and traversing the universe of GC objects. Probably with bad locality, polluting the cache, etc."

This is only the case with a mark-sweep collector, usually most of your allocations die young in the nursery. With reference counting you pay the counting cost for everything.

"In object-oriented languages, where you can literally have a pointer to something, you simply mark a reference as a weak reference if it might create a cycle."

As someone who has tried to identify memory leaks in production where someone has forgotten to "simply" mark a reference in some deep object graph as weak, this is naive.

"With an 8-byte counter you will never overflow. So...you know...just expand up to 8-bytes as needed? Usually you can get by with a few bits."

So now my "about as minimal as you can get short of nothing at all" check as an unpredictable branch in it?

"If you must overflow, e.g., you cannot afford an 8-byte counter and you need to overflow a 4-byte counter with billions of references, if you can copy it, you create a shallow copy."

I don't even know where to begin with this.

"If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?"

This probably has more to do with finalising resources and deterministic destruction than anything else.

Anyone who is interested in actually studying this area would probably find https://courses.cs.washington.edu/courses/cse590p/05au/p50-b... interesting. Also https://gchandbook.org/

omginternets 3 years ago |

Whenever I read someting like this, I wonder what kind of programming the author is doing. I’m getting a strong whiff of embedded and/or real-time systems.

UltraViolence 3 years ago |

Reference counting forces the developer to think about memory management.

Apple has a nice talk on ARC [1] but it got me thinking: if I have to think about reference counting this much I might just as well manage memory all by myself.

The true joy of Garbage Collection is that you can just create objects left and right and let the computer figure out when to clean them up. It's a much more natural way of doing things and lets computers do what they're best at: taking tedious tasks out of the hands of humans.

[1]: https://developer.apple.com/videos/play/wwdc2021/10216/

dfox 3 years ago |

One great advantage of garbage collection is that it removes need for thread synchronization in cases where is it only needed to make sure that object jou are going to call free/decref on is not in use in another thread. Corollaly to that GC is the thing that you need for many lock-free data structures to be practical and not of only theoretical interest.

It might seem that it is simply about pushing your synchronizations problems onto the GC, but the synchronization issue that GC solves internally is different and usually more coarse-grained, so in the end you have significantly smaller synchronization overhead.

melony 3 years ago |

You are in for a fun time when you need to make circular data structures with ARC.

freecodyx 3 years ago |

What if programming languages start offering both? In my opinion RC is GC in disguise. At least for example Golang GC has the merit to run in a separate thread(still has to stop the world when reclaiming memory back, and the memory allocator model is helping achieve great GC perfs).

eru 3 years ago | |

Python does both reference counting and garbage collection.

Btw, GC is also often RC in disguise. What I mean is that generational garbage collectors are basically a hybrid of tracing GC and RC. See https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... for the details.

samsquire 3 years ago |

My understanding of Python's Global Interpreter Lock is that reference counting cannot be done efficiently between threads, so we cannot remove the GIL with reference counting

Java's GC is concurrent and runs at safe points and stops the world so it avoids this problem.

nikolay 3 years ago |

When I wrote a Lisp interpreter in the '90s, that's how I did it and I'm ashamed to admit that I have no idea how modern GC is done - I've always assumed (naively!) that it was like Lisp's!

eru 3 years ago | |

Modern Lisps likely have modern GCs. I mean there's no reason for them not to.

Racket probably has a state-of-the-art garbage collector. (I don't actually know, but that's where I would start looking.) Clojure obviously has the same garbage collector as any other JVM language.

gus_massa 3 years ago | | |

Racket has like 5 GC, perhaps more.

In one extreme you can build Racket using the Senora GC that is conservative and not moving, that is used only for bootstraping.

On the other extreme, both of the normal versions of Racket have custom moving incremental GC. The docs with some high level explanations are in https://docs.racket-lang.org/reference/garbagecollection.htm...

The implementation details of the main "CS" version are in https://github.com/racket/racket/blob/master/racket/src/Chez... It's a replacement of the default GC of Chez Scheme that has better support for some features that are used in Racket, but I never have looked so deeply in the details.

_8j50 3 years ago |

Pardon the ignorance but I thought refcount was a GC strategy?

viktorcode 3 years ago | |

The academia tends to call RC a form of GC. For programmers experienced in languages with manual memory management those are very different beasts.

exabrial 3 years ago |

> I've already stated I'm not going to do benchmarks.

Yikes

glouwbug 3 years ago |

Isn’t garbage collection needed to solve circular reference counts?

arcticbull 3 years ago | |

Nope, you can just mark the back-reference as weak.

GC is only required if you as a programmer (or programming language) do not provide sufficient information to the compiler or runtime to understand the object graph.

klodolph 3 years ago | | |

It's not always obvious to know which reference to mark as weak, and there's not necessarily a clear indication of which reference is a back-reference.

You can find various algorithms in journals or whatnot written with the assumption that there's GC. Algorithms designed with this assumption may not have clear ownership for objects, and those objects my have cyclic references.

It's easy to say, "objects should have clear ownership relationships" but that kind of maxim, like most maxims, doesn't really survive if you try to apply it 100% of the time. Ownership is a tool that is very often useful for managing object lifetimes--it's not always the tool that you want.

amiga1200 3 years ago |

Just allocate then deallocate manually, use neither auto methods.

eru 3 years ago | |

What do you mean by 'manually'? Malloc and free still do lots of work.

Or do you want to manually assign memory addresses to your objects?

mirekrusin 3 years ago |

What about - garbage collect by reference counting, like Python?

eru 3 years ago | |

Python has reference counting for historical reasons, and added tracing garbage collection for dealing with cycles.

If you wanted performance these days, you wouldn't want to go for that architecture. It's a historically accident that they can't really free themselves from because of backwards compatibility.

jayd16 3 years ago |

Is there such a thing as a compacting RC?

hayley-patton 3 years ago | |

Backup compaction can be useful, like backup tracing can be, but you can also use all the initial increments in a coalescing RC collector to determine which pointers need to be fixed up for copying, without tracing. See http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla... pages 8 and 9 on "Defragmentation with Opportunistic Copying" e.g.

rtfeldman 3 years ago | |

There was a great talk at Strange Loop about a drop-in malloc replacement which compacts.

Apparently it actually led to memory usage improvements in industrial projects like Redis:

https://youtu.be/c1UBJbfR-H0

eru 3 years ago | | |

See https://github.com/plasma-umass/Mesh for the code and a link to the paper.

titzer 3 years ago |

I read most the article and it's just a lot of the same tired old arguments and an extremely simplified worldview of both GC and reference counting. I wish I had the author's address, because I'd like to mail them a copy of the Garbage Collection Handbook. They clearly have a very naive view of both garbage collection and reference counting. And there isn't a single dang measurement anywhere, so this can be completely dismissed IMHO.

cogman10 3 years ago | |

Agreed. What I particularly disliked is how absent of nuance it is. RC is a form of GC and all GC algorithms make tradeoffs. RC trades throughput for latency. Compacting mark and sweep trade latency (and usually memory) for throughput.

The rant at the end can be boiled down to "I use confirmation bias [1] to make my engineering decisions". The OP has already decided that "GC" is slow, so I'm sure every time a runtime with it misbehaves it's "Well, that darn GC, I knew it was bad!" and every time RC misbehaves it's likely "Oh, well you should have nulled out your link here to break the cycle dummy!"

I really don't like such absolutist thinking in software dev. All of software dev is about making tradeoffs. RC and GC aren't superior or inferior to each other, they are just different and either (or both) could be valid depending on the circumstance.

[1] https://en.wikipedia.org/wiki/Confirmation_bias

titzer 3 years ago | | |

> absent of nuance

Yes, this is a good point. It makes overly general claims.

E.g. a GC proponent could claim "well, tracing collectors do no work for dead objects, so they have no overhead!" Which is a good point, but not the whole story. Tracing collectors may need to repeatedly traverse live objects. Sure. But then generational collectors only traverse modified live objects that point to new objects. True. And concurrent collectors can trace using spare CPU resources, incremental collectors can break marking work up into small pauses, on and on. There are zillions of engineering tradeoffs and the GC Handbook covers most of them really well.

bitwize 3 years ago |

Boy, I can't wait for theangeryemacsshibe (posts here as hayley-patton) to tear into this one.

But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index.

If you use a language with GC, you're probably just contributing to global warming.

kaba0 3 years ago | |

Why not just write embedded programs with fixed size memory allocation then if we are that okay with restricting the programs we write?

bitwize 3 years ago | | |

Because maybe we're not okay with restricting the programs we write that much.

eru 3 years ago | |

Doesn't RAII only work when your lifetimes are in a nested hierarchy?

(Basically, your lifetimes have to be the same as your scopes, which are in a simple tree structure only.)