Parallel garbage collection for SBCL [pdf](applied-langua.ge) |
Parallel garbage collection for SBCL [pdf](applied-langua.ge) |
Steel Bank Common Lisp 2.3.8 released: “a mark-region parallel GC is available” - https://news.ycombinator.com/item?id=37295611
As an aside, thanks for your efforts. Adding a new GC is a major effort, which improves things for all of us! :D
Can I ask a newbish question: will this new GC be available on all OS:s and CPU architectures, or only on some? I don't see anything in the paper about being limiting to some certain platform, so my hopes are high :).
The old gencgc was pretty cool for the single core era, and it sounds like it still holds up well. If I recall correctly, it was based on the Bartlett Mostly Copying paper, which is an elegant and practical GC design. https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-12.pdf. I miss these old papers that described this stuff in a way you didn’t have to be a math major to understand. I think the first version of that paper had the C++ code as an appendix: https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-88-2.pdf.
Clarity in your technical communications matters. The Immix papers are similarly well written and clear. I don’t think it’s a surprise that both GC designs have also been independently implemented over and over. The Chaitin-Briggs register allocator is another example where I’d attribute at least some of the success in widespread industrial implementation to Briggs’ excellent and approachable PhD thesis describing the algorithm: https://www.cs.utexas.edu/users/mckinley/380C/lecs/briggs-th...
Article: https://wingolog.org/archives/2023/02/07/whippet-towards-a-n...
Indeed; it does.
I wish more people watched that Steels talk, where he speaks about importance to use accessible language. One-syllabus words all the way would be perhaps a bit tedious read, but he makes a very good point about clarity in connection to simplicity.
Thanks for all the links Raiyner.
- SBCL is callable as a shared library (see sbcl-librarian)
- sb-simd
- prebuilt binaries for Android (termux, unofficial)
- better image compression using zstd
- I'll add https://github.com/sionescu/sbcl-goodies, binaries with "goodies" inside (OpenSSL, libfixposix)
yay!
[0]: https://lisp-journey.gitlab.io/blog/these-years-in-common-li...
bonus from 2021: true static binaries are coming https://www.timmons.dev/posts/static-executables-with-sbcl-v...
This is impressive progress, especially for a language so many have written off as moribund. And that's just one implementation of the spec. ECL, ABCL, and Clasp are all progressing at a brisk pace too. Maybe the only good language is a dead language.
Correct me if I'm wrong but this has not happened yet.
Does anyone know how difficult implementing a Real-Time GC would be for SBCL or ECL. I know of that paper by Rodney Brooks (L -- A Common Lisp for Embedded Systems)
From a technical (i.e. useless) point of view SBCL very nearly already has a real-time GC, with a few modifications it would qualify since already:
1. The amount of work in a GC operation is bounded by the heap size
2. SBCL has a fixed maximum heap size
Two things would need to be done:
1. Ensure the areas where GC is inhibited are bounded
2. Call GC operations on a timer, and when they are done, ensure there is sufficient free space; RTGC cannot exist without specific bounds on the mutator, so you could almost certainly invent bounds on the mutator that would make the gencgc qualify as real-time for.
#1 would need to be done for any actually useful RTGC anyways.
A slightly less snarky answer is that a non-toy GC is a lot of work. Different choices in GC will affect code-generation (e.g. read and/or write barriers GC, and forwarding-pointers for incremental GC[1]).
There's a reason the gencgc has been around as long as it has, and it's because it's "good enough" for a lot of people and the work needed to replace it (with any non stop-the-world GC anyways) is rather large. Even TFA is neither incremental nor concurrent, just parallel.
1: Stop the World GCs may also use forwarding pointers, but codegen doesn't need to know about them because they are fully resolved before the mutator continues.
That there is no pauses happening during the runtime?
As such, it will greatly increase throughput if given multiple cores, and will lead to lower pause times based on this improvement in throughput.
If your technical (theoretical) definitions are useless, you need to use different definitions. A practical definition of “real-time” GC is a minimum mutator utilization bound for any program given sufficent RAM, but the only GC I’m aware of that does this (despite the abundance of papers on GC wirh “real-time” in the title) is Klock’s regional collector[1]. Unfortunately, it seems to be impractically complex. [To be fair, I don’t think any implementation of malloc() in common use would satisfy this constraint either.]
I mean, they release on a monthly basis... pretty much active I would argue? https://www.sbcl.org/all-news.html
With that said, the upcoming immix collector is really friggin exciting.
If given many cores or perhaps in a later revision with improved efficiency. Today throughput on a four core CPU seems to be lower.
* I expect latency to be better with compacting, but I haven't benchmarked, so I haven't a clue.
I'm not sure I agree with the comment about "up to the task for real-time games", but maybe my standards are low for video games, and I have an abnormally fast machine and am likely biased there too.
This latency problem is why I’m dabbling with raylib and LFE of all things. The BEAM cannot hold a candle to SBCL for performance code, but the GC latency can be managed better.
Anyway, I don’t want to go on a tangent, since I really appreciate your efforts; there are plenty of domains that the new SBCL GC will be a big deal for.
And the new SBCL release aids this approach by expanding the cases where the compiler will stack allocate.
The question of "what is a real-time GC" is academic because nobody actually wants a real-time GC. They want a real-time system. And some GC algorithms can be used to construct a real-time system. Indeed, as others have noted, SBCL can be used to construct a real-time system if you don't do any allocations.
A soft real time system is one in which every instruction that the programmer writes has a known well-defined maximum time bound in which it executes. This can be achieved in a GC system if the GC pause time also has a known well-defined time bound, and if it is well defined which instructions can or can't be paused by the GC for that (max) amount of time.
This constraint is important for building systems because it allows the programmer to guarantee that certain parts of their program will fit within a given time bound, such as ensuring that they can fit in the window for rendering a frame to achieve 60FPS or per-packet processing to achieve 1G PPS etc.
The commercial Java implementation JamaicaVM claims to have such a soft real time guarantee, with maximum GC pause times in the order of microsec. They have a paper about it presented at the ACM [0]. I haven't read the paper myself or used their product in any way, I'm only presenting their claims.
Hard real time systems require a fixed time per operation, which would allow one to guarantee that, say, a given loop executes some operation at a precise frequency, which may be critical for things like analog signal processing. This is typically much harder to achieve using a GC system.
I'm going to push back against that definition since:
1) RAM is usually bounded
2) A "never free" allocation strategy meets this requirement.
3) The "minimum mutator utilization bound" needs to be at a given time-slice anyways; e.g. "No more than 50% utilization in any 10ms window" rather than just "No more than 50% utilization." Throughput optimized GCs already give a very low utilization over a very long window; it's narrow windows that they struggle with.
I think a practical definition gives minimum bounds for:
M: fraction mutator utilization over a narrow time window T (where narrow depends on the problem domain; the orders of magnitude from 10us to 100ms are all reasonable depending on the domain)
O: heap size overhead (as a greater than unity multiple of live data)
And conditions them on peak allocation rate R, and heap size S, which each may have some reasonable maximum value.
If you pick non-reasonable values for R, T, and S then "obviously not realtime" GCs can be made to qualify. If you don't condition your GC on R, T, and S then actual shipping real-time systems using GC don't qualify (e.g. Metronome/Stacatto based systems).
> A "never free" allocation strategy meets this requirement.
Yeah, I could have put that better. “Sufficient” probably needs to be defined as a reasonable function of the program’s working set—I’d have said “independent of the program” instead of “reasonable”, except that’d make any non-compacting malloc necessarily non-real-time, as the (Robson) bound[1] there has to depend not only on the working set but also on the ratio between the smallest and largest allocations. On the other hand, maybe forbidding non-compacting mallocs is actually reasonable, because that bound is impractically large anyway.
> The "minimum mutator utilization bound" needs to be at a given time-slice anyways; e.g. "No more than 50% utilization in any 10ms window" rather than just "No more than 50% utilization."
No, the “minimum” in the name refers to a minimum (proportion of time) over all windows. See link in GP and the thesis referenced there for a discussion of the dependency on the window size (which is a bit funny for all allocators).
That's just false; I can't think of a GC that can maintain more than 0% utilization in a time-window the length of a compare-and-swap operation (which could be 100s of cycles on some CPUs)
Just pushing triangles isn't a problem (so far), but I'm still unclear how much of the engine logic will need to live in a low level language versus within LFE. The BEAM also presents certain architectural possibilities (especially for simulation), that are completely atypical for games.
It's promising enough that I'll continue pursuing this; we'll see where I end up in a few more months. Sorry I can't give more meat than that!
Also problems with GC basically don't exist when running Erlang/LFE code, at least you have to put a lot of effort into getting problems with GC. A lot! And, of course, the BEAM does automatic load balancing over multiple cores which is another problem gone away. We never have to think about the class of problems like "I have my system set up to run on 2 cores, how should I modify it to run on 4?".
Do you have a git repo i can star and check back for updates ?
Are you developing on Linux or MacOSX ?
I'm keeping a list of awesome lfe projects , even if you make it to a 'demo' stage, I'd love to list it, please consider letting me know.
I will let you know. It's extra motivation to me too! Someone else is interested!