Reverting the incremental GC in Python 3.14 and 3.15(discuss.python.org) |
Reverting the incremental GC in Python 3.14 and 3.15(discuss.python.org) |
A couple of services looked like they had a memory leak. Memory was continuously increasing over time. Thanks to Python 3.14, we were able to use memray to understand what was going on. Those services were recreating HTTP clients (aiohttp) for every inbound request, and memory allocated by the downstream SSL lib was growing faster than it was being released.
We ended up rolling back to 3.13, which fixed the issue. I'll try again with 3.14.5.
The reference cycle httpx creates is kind of a worst-case scenario for the incremental GC issue. Both the generational (3.13 and older) and incremental GC are triggered by the net new "container" objects (objects that have references to others, like lists and not like ints and floats). The short summary is that you need to create more container objects before the incremental GC triggers. In the case of the httpx reference cycle, you have a relatively small number of container objects hanging on to a lot of memory, due the SSL context data (which is a big memory hog).
Reverting back to the generational GC was the wise thing to do, even though it's a bit scary to do in a bugfix release. The incremental GC works for most people but in the minority of cases it doesn't, it uses quite a lot more memory. I'm pretty sure with some additional tuning, the incremental GC would be fine too but it just didn't get that tuning. The generational GC has literal decades of real-world use (Guido merged my patch on Jun 2000, Tim Peters did a bunch of tuning after that to optimize it).
Unfortunately, you may be the wrong gender to contribute to Encode repositories like httpx:
> I've closed off access to issues and discussions.
> I don't want to continue allowing an online environment with such an absurdly skewed gender representation. I find it intensely unwelcoming, and it's not reflective of the type of working environments I value.
— https://github.com/encode/httpx/discussions/3784
Discussed on Hacker News here: https://news.ycombinator.com/item?id=47193563
A fork discussed here: https://news.ycombinator.com/item?id=47514603
It's annoying that somehow talking to S3 etc requires so much churn. We have been trying to cache session objects and the like but clearly are still missing something.
Chasing this down has also made me realize how little Python libs use `weakref`, and just will build up so many circular references. The other day I figured out Django request's session infrastructure creates a circular reference meaning that requests have to get GC'd to get cleaned up in CPython.
I have a suspicion that the 3.14 problems are heavily linked to "real" workloads being almost entirely filled with cyclical objects.
Is there a way to make all this much easier to debug and to prevent memory issues in the first place? Is the abstraction level not quite right?
We’ve decided to revert it in both 3.14 and 3.15, and go back to the generational GC from 3.13."
Sounds the right move for me
I thought that by now dynamic garbage collection was a known quantity so that making changes, outside of out right bugs, is fairly safe and predictable?
So any change to GC starts with massive .Net MSFT code base so they get extremely good telemetry back about any downsides and might be able to fix it in time.
There is almost no dog fooding on Windows development since version 8, Typescript team rather rewrite the compiler in Go, Azure has plenty of Go, Rust and Java projects alongside .NET.
I’ll confess the reason it hit us so hard is because the code quality was so low and wasteful on allocations that it didn’t hide the problem as well as previous versions.
So I think it was not a big problem for .Net because it gave you enough control over GC, and because people tested their code before putting it in production.
One of heuristics to find memory leaks can be stated as follows: if you instantiate any HTTP or connection objects inside the request handler, it's likely that you've made a mistake.
I hope Meta switches Instagram to PHP/Hack so they leave Python alone.
Free-threading actually uses its own, separate GC: https://labs.quansight.org/blog/free-threaded-gc-3-14
You are free to switch language but you still need to understand it.
Also, even if it looks like that to you, there are still people that write code with their own hands.
PyPy doesn't have the support it needs and is stuck on 3.11.
Obviously it's not easy to move the whole language of a big codebase, but I feel a lot of this stuff (fiddling with GC, JITing, type hints, and I'm dubious about the free-threading stuff) tries to take Python somewhere it isn't really good at, and if that's what you want, you really want a different language.
It's predictable vs Rust, C#, F#, Elixir, Go, etc.?
Java? Nope, you're getting a fundamental change in Valhalla C++? Nope, new language edition every few years with fundamental changes C? C23 has a number of fairly fundamental changes, expect more in the next language revision
I think your sense of causality is backwards here. These languages are getting fundamental changes because they're being widely used. That is what motivates and drives the change. Languages with no users don't need to change.
So for example
class Request
class Session
request.session exists, and the session is "part" of the request. but session.request often exists as a facility. That's a reference cycle which prevents the request (and anything it's pointed at!) from being deallocated at the end of a request.
But in this case, you could easily do something like:
session._request = weakref.ref(request) # on session creation
and then have session.request call session._request() (and maybe assert session._request() is not None if you want to be certain). If you're confident that the session is a "child" of the request, and that you would _never_ have a hold of the session after the request is done, this is a cheap trick that makes session.request cost a little bit more but not much.
I think most Python libraries just don't do memory perf analyses here, and also "believe" in the garbage collector. When GC runs, both request and session will get deallocated, after all! But the long term effects of everyone relying on the GC are that GC is expensive when it doesn't need to be, and when looking through memory you just have more stuff to dig through
But most such languages handle much better the compatibility with legacy applications.
Python is the main culprit in most cases when I see conflicts between various software packages that insist to use only a specific version of their dependencies. This is why I have to keep installed many versions of Python, and the Linux distribution that I use must take care to prevent interference between those Python versions.
That's fine, but that's clearly not what I'm talking about.
Languages like F#, Elixir, etc. don't undergo fundamental changes. Yes, every language evolves. But for Python, we're talking about grafting literally fundamental stuff on top of a language not designed for any of these things.
For example, if someone went and redesigned Python to solve its warts, you'd basically end up with F#.
People parrot to use OpenJDK without understanding it is mostly Oracle employees working on it.
And if you dislike Oracle, the other minor contributors are Red-Hat, IBM, SAP, Microsoft, Alibaba, Azul,... which for many HNers are the same.
jython went EOL.with python 2 going EOL.
We are just different. That's not something to be mad about.
Go is verbose partly for that reason, but a silly loop is a silly loop. The constraints are clear, you only have to do the logic.
Python has a different problem: it is slow as f---. I did a micro benchmark comparison against 5 other languages in preparation for my python replacement language. Outside of dictionary lookups, it is 50-600 times slower than C depending on the workload.
Go, Rust etc are fine. They land at 1.25-3x slower than C. But I prefer the readability of python minus its dynamic nature.
Many early programming languages handled errors by providing multiple return addresses to functions.
Thus a function returned to the place of invocation in the normal case, otherwise it returned to one of the error handlers that were grouped at the end of the function that invoked it, away from the happy path.
This was more efficient than the modern way of returning an error code, as it eliminated the superfluous testing of the error code and the expensive conditional jump based on the result of the testing.
Moreover, in my opinion in the majority of the cases maximum information about an error is inside the function that has invoked the function where the error happened and neither in an exception handler several levels above it, nor inside the invoked function, so the place where the error handlers had to be put in this old method is actually optimal for meaningful error messages.
From the point of view of the source text, this old error handling method is equivalent with using exceptions with the constraint of placing the exception handlers in the function that has invoked the function that generates the exception. This limitation enabled a more efficient implementation than for exceptions where the handler can be placed anywhere above the invoked function and a complex stack unwinding may be necessary.
Not to mention that there are differences in ecosystem, familiarity, and ergonomics that may make a team want to stick with Python.
“Just use Go” is not really actionable advice in most cases.
$ time ./a.out
real 0m0.002s
user 0m0.000s
sys 0m0.002s
A do-nothing Go program: $ time ./tmp
real 0m0.002s
user 0m0.000s
sys 0m0.003s
I don't believe Go has any optimizations to not start its runtime if it isn't necessary, but when I added spawning a goroutine that immediately blocks on a channel read that will never come the numbers didn't change. That doesn't really time the runtime. Probably the program terminated before the goroutine was scheduled to run anything. It just makes it so there definitely wasn't an early exit because the compiler or the runtime "realized" it didn't need to start the runtime.I'm sure the Go program is somewhat slower to start and end than C, and that we're running into the limits of how quickly processes can be spawned and other timing overhead which is obscuring the difference. However for practical purposes, "it starts up in less than the overhead for starting a process in the shell" is the same speed for most purposes.
Not even a "do nothing" Python program, no Python program at all:
$ time python3 -c 1
real 0m0.012s
user 0m0.008s
sys 0m0.004s
If you had a Go program that was slow to start up, it was your program, not Go. By contrast, Python, and the dynamic scripting languages in general, can be quite slow to start up, just in the reading and compiling of the code. (Even .pyc files, IIRC, take processing, just less processing than Python source code... it's still nowhere near "memory map it in and go" as it is for statically-compiled languages.)I suppose Go programs are slower than the equivalent thing in C or C++, but I'm not sure that's a very relevant comparison in most cases today (how many new things being written would choose those languages).
Windows Development is not "We are not dogfooding", it's that incentives are misaligned with customer wants.
.Net team incentives are aligned with customer wants, provide a language that is highly performant and easy enough to write.
I have my WinRT 8, UAP 8.1, UWP 10, Project Reunion, .NET Native, C++/CX, C++/WinRT, XAML Islands, XAML Direct, WinUI 2.0, WinUi 3.0, WinAppSDK and what not scars to prove how they aren't dog fooding any piece of it in any meaningful manner.
Heck they keep talking about C++ support in WinUI 3, as if the team hasn't left the project and is now playing with Rust instead.
They managed that plenty of early WinRT advocates became their hardest critics, while not believing anything else they put out, like now this Windows K2 project.
Go is, essentially, nearly perfect at what it does - even if the language itself leaves much to be desired and would ideally be much safer.
Microsoft should up their game. They have a few research languages in development.
They've always been great with languages. Hopefully, they rise to the occassion.
Now we're stuck with it in anything CNCF related.
Sure, Rust is completely different beast with different target system.
> (Mocking) Yes, that's why we should go back to Y with even worse static analysis.
Sure
-- Rob Pike
Go fits well close to Oberon released in 1987, or Limbo in 1995, when exceptions and generics were still esoteric features.
Instead they had to reach out to Phil Wadler to help them, as he did previously with Java almost a decade earlier, panic/recover is clunky way to do exceptions, instead of doing enumerations like Pascal in 1976, it needs a a iota/const code pattern, hardcoded urls for source repos, if err all over the place like last century programming, many errors are plain strings, ah and nil interfaces what a great gotcha.
Lately, they seems to work with CRIU, various heuristics, multi-stage in-process bytecode compilation ..
Java is a mess, they are working hard to avoid fixing their issue (that nobody else have, so fixes are available)
Compared to Python's, all of them are beyond perfect. And 99.9% of the time you don't even need to use anything but the default.
I somehow understand the situation less after reading this.
Is Python's GC bad, or are there cyclic reference issues? Is it possible to detect cyclic references perfectly? What does beyond perfect mean? If we have 7 and 0.1% of the time you need one of the 6 that is non-default, how do we choose? Is the understated version of "Compared to Python's, all of them are beyond perfect" "I think Java's are great"? If not, what about Python's impl makes it so lackluster to any of 7 of Java's?
Not sure what you mean by this, as this has nothing to do with GC, and Java has had a multi-tier optimising compiler for 15 years now.
> that nobody else have, so fixes are available
Go has much worse problems with GC than Java does these days, and nobody else is able to achieve similar performance in large programs with heavy workloads. So everyone else lives with less sophisticated compilers and memory management simply by accepting worse performance.
Your last phrase is probably is joke ? If not, please share reproductible numbers: some things per watt, some things per MB of memory.
Yes. The GCs in Java, .NET, V8, and Go do it.
> If we have 7 and 0.1% of the time you need one of the 6 that is non-default, how do we choose?
Java's GC are optimised for different workloads and environments, and when the choice matters, they're easy to choose among:
1. Parallel GC: Maximal throughput when latency doesn't matter (batch processing).
2. Serial GC: Very small machines.
3. ZGC: low latency (<<1ms maximal pause, i.e. effectively pauseless)
4. G1 (the current default): A balanced mix of throughput and latency.
These are all the standard GCs (the seven you mentioned include a GC similar to Go's that was removed years ago, an "no op" GC for benchmarking hidden behind a development flag, and alternative implementations by different companies to some of the ones above).
It's possible that either Serial or Parallel will be removed when G1 is able to fully replace them.
Now, why do users need options? Because Java runs most of the world's finance, manufacturing, shipping and logistics, telecommunication, travel, healthcare, retail, defence, and government. We're talking large, complex software that handles huge workloads, and the needs vary. What works well enough for a CLI dev tool or a simple website is often not good enough to handle the world's credit card transaction processing or mobile phone networks.
> If not, what about Python's impl makes it so lackluster to any of 7 of Java's?
Java's GCs are moving collectors, which offer advantages not just compared to Python's GC but to all memory management strategies. Memory management (even in C) imposes a CPU/RAM tradeoff. Moving collectors (used in Java, .NET, and V8) give you a knob for controlling the tradeoff, i.e. they're able to convert RAM to CPU (i.e. use RAM chips as a hardware accelerator) and vice-versa.
Unless you're being pedantic and including reference counting without cycle detection as GC, if your GC has cyclic reference issues, your GC is bad.
> Is it possible to detect cyclic references perfectly?
Yes? That's the entire point of tracing GC. You have some set of root objects that you start with (globals, objects on thread stacks, etc.) and then you mark every object that's reachable from them. Anything that's not reachable is garbage, even if there are cycles within them.
Both can be true. The first can even be wholly or partly due to the second.
On addition, the way it does it via RC causes fragmentation, poor locality for caches, and general slowness for mass allocations. And it's one-size-fits-all.
Java has a much larger selection to pick to finetune specific use cases, which each being far greater for that use case. And the default no-need-to-think one (G1 iirc), is already faster and better than Python's.
Memory allocator: tcmalloc, jemalloc, they are concerned with fetching (and releasing) pages of memory from the OS and allocating objects for the program
GC is only responsible for saying to the memory allocator "this object is no longer used"
(please stay focused on java)
This is simply not true. Not only are Java's GCs responsible for allocation (which they do simply by bumping a pointer, similar to stack allocation), unlike Python's refcounting collector or Go's nonmoving tracing collector, they have no free operation of any kind and never free objects. Moving collectors don't even know when an object is freed. The way they work is that they compact the live objects, and because the dead objects are invisible to them, they happen to write over them when compacting the live ones. Refcounting collectors and nonmoving tracing collectors do use a free-list-based allocator that they use for allocation and deallocation, but moving collectors work completely differently.
Moving collectors can be so efficient that in the eighties, there was a famous paper about them called "Garbage Collection Can Be Faster Than Stack Allocation" [1] showing that the cost of managing an object's lifetime with a moving collector can, in principle, be less than a single machine instruction. That's why the cost of heap memory management in Java (and other runtimes that employ moving collectors) cannot be compared to the cost of memory management in languages using free lists, be it C or Python. Their operation is just too different (e.g. in Java, assigning a value to an object field or setting an array cell could sometimes be more expensive than allocating a brand new object/array).
No, you're missing the fact that the allocation of memory and the GC go hand in hand, because you need it so for optimizations. They are designed together to cooperate in modern runtimes.
While the JDK's don't work quite like this, a simple way to picture it is as a a contiguous memory buffer, say, 100MB large. The GC compacts the live objects to the bottom of that buffer. At that point they do know where the end of the used memory is. Say that after compaction, the live objects occupy the bottom 20MB of the buffer (overwriting any dead objects that may have been there). At that point, the GC can choose to uncommit the top 30MB of the buffer and return it to the OS.
With malloc/free there's also this separation. A free operation marks the memory of a freed object in a data structure called a free list. If all the objects in a certain page have been freed, the allocator can choose to uncommit it and return it to the OS (although many modern allocators choose not to do this promptly).
The operation of moving collectors and where there cost is is completely different from free-list approaches. With free list approaches there's some work done to allocate an object and some work done to free it. With moving collectors, there's very little work to allocate an object (typically, just a pointer bump) and no work to free it; there is, however work to keep the object alive. That's why, especially with generational collectors, moving collectors do very little work to manage the memory of objects that don't live for long.
This is why, when working in Java, it's important not to think about the heap as we do in C. For short-lived objects, the cost of allocatng and "deallocating" them is often not significantly higher than the cost of allocating and deallocating an object on the stack in C. On the other hand, mutating an existing long-lived object could sometimes require some bookkeeping work by the GC, and could be much more costly than allocating (and "deallocating") a new object. That's why Java programmers are discouraged from pooling objects to "help the GC", something that Go developers often do when they run into issues with their non-moving, non-generational collector.