Free-threaded CPython is ready to experiment with

Free-threaded CPython is ready to experiment with(labs.quansight.org)

532 points by ngoldbaum 1 year ago | 378 comments

eigenvalue 1 year ago |

Really excited for this. Once some more time goes by and the most important python libraries update to support no GIL, there is just a tremendous amount of performance that can be automatically unlocked with almost no incremental effort for so many organizations and projects. It's also a good opportunity for new and more actively maintained projects to take market share from older and more established libraries if the older libraries don't take making these changes seriously and finish them in a timely manner. It's going to be amazing to saturate all the cores on a big machine using simple threads instead of dealing with the massive overhead and complexity and bugs of using something like multiprocessing.

pizza234 1 year ago | |

> using simple threads instead of dealing with the massive overhead and complexity and bugs of using something like multiprocessing.

Depending on the domain, the reality can be the reverse.

Multiprocessing in the web serving domain, as in "spawning separate processes", is actually simpler and less bug-prone, because there is considerably less resource sharing. The considerably higher difficulty of writing, testing and debugging parallel code is evident to anybody who's worked on it.

As for the overhead, this again depends on the domain. It's hard to quantify, but generalizing to "massive" is not accurate, especially for app servers with COW support.

bausgwi678 1 year ago | | |

Using multiple processes is simpler in terms of locks etc, but python libraries like multiprocessing or even subprocess.popen[1] which make using multiple processes seem easy are full of footguns which cause deadlocks due to fork-safe code not being well understood. I’ve seen this lead to code ‘working’ and being merged but then triggering sporadic deadlocks in production after a few weeks.

The default for multiprocessing is still to fork (fortunately changing in 3.14), which means all of your parent process’ threaded code (incl. third party libraries) has to be fork-safe. There’s no static analysis checks for this.

This kind of easy to use but incredibly hard to use safely library has made python for long running production services incredibly painful in my experience.

[1] Some arguments to subprocess.popen look handy but actually cause python interpreter code to be executed after the fork and before the execve, which has caused production logging-related deadlocks for me. The original author was very bright but didn’t notice the footgun.

skissane 1 year ago | | |

Just the other day I was trying to do two things in parallel in Python using threads - and then I switched to multiprocessing - why? I wanted to immediately terminate one thing whenever the other failed. That’s straightforwardly supported with multiprocessing. With threads, it gets a lot more complicated and can involve things with dubious supportability

phkahler 1 year ago | |

I feel like most things that will benefit from moving to multiple cores for performance should probably not be written in Python. OTH "most" is not "all" so it's gonna be awesome for some.

wongarsu 1 year ago | | |

I often reach for python multiprocessing for code that will run $singleDigit number of times but is annoyingly slow when run sequentially. I could never justify the additional development time for using a more performant language, but I can easily justify spending 5-10 minutes making the embarrassingly parallel stuff execute in parallel.

eigenvalue 1 year ago | | |

I personally optimize more for development time and overall productivity in creating and refactoring, adding new features, etc. I'm just so much faster using Python than anything else, it's not even close. There is such an incredible world of great libraries easily available on pip for one thing.

Also, I've found that ChatGPT/Claude3.5 are much, much smarter and better at Python than they are at C++ or Rust. I can usually get code that works basically the first or second time with Python, but very rarely can do that using those more performant languages. That's increasingly a huge concern for me as I use these AI tools to speed up my own development efforts very dramatically. Computers are so fast already anyway that the ceiling for optimization of network oriented software that can be done in a mostly async way in Python is already pretty compelling, so then it just comes back again to developer productivity, at least for my purposes.

jillesvangurp 1 year ago | | |

Right now you are right. This is about taking away that argument. There's no technical reason for this to stay true. Other than that the process of fixing this is a lot of work of course. But now that the work has started, it's probably going to progress pretty steadily.

It will be interesting to see how this goes over the next few years. My guess is that a lot of lessons were learned from the python 2 to 3 move. This plan seems pretty solid.

And of course there's a relatively easy fix for code that can't work without a GIL: just do what people are doing today and just don't fork any threads in python. It's kind of pointless in any case with the GIL in place so not a lot of code actually depends on threads in python.

Preventing the forking of threads in the presence of things still requiring the GIL sounds like a good plan. This is a bit of meta data that you could build into packages. This plan is actually proposing keeping track of what packages work without a GIL. So, that should keep people safe enough if dependency tools are updated to make use of this meta data and actively stop people from adding thread unsafe packages when threading is used.

So, I have good hopes that this is going to be a much smoother transition than python 2 to 3. The initial phase is probably going to flush out a lot of packages that need fixing. But once those fixes start coming in, it's probably going to be straightforward to move forward.

jodrellblank 1 year ago | | |

https://www.servethehome.com/wp-content/uploads/2023/01/Inte...

AMD EPYC 9754 with 128-cores/256-threads, and EPYC 9734 with 112-cores/224-threads. TomsHardware says they "will compete with Intel's 144-core Sierra Forest chips, which mark the debut of Intel's Efficiency cores (E-cores) in its Xeon data center lineup, and Ampre's 192-core AmpereOne processors".

What in 5 years? 10? 20? How long will "1 core should be enough for anyone using Python" stand?

Derbasti 1 year ago | | |

A thought experiment:

A piece of code takes 6h to develop in C++, and 1h to run.

The same algorithm takes 3h to code in Python, but 6h to run.

If I could thread-spam that Python code on my 24 core machine, going Python would make sense. I've certainly been in such situations a few times.

DanielVZ 1 year ago | | |

Usually performance critical code is written in cpp, fortran, etc, and then wrapped in libraries for Python. Python still has a use case for glue code.

tho34234234 1 year ago | | |

It's not just about "raw-flop performance" though; it affects even basic things like creating data-loaders that run in the background while your main thread is doing some hard ML crunching.

Every DL library comes with its own C++ backend that does this for now, but it's annoyingly inflexible. And dealing with GIL is a nightmare if you're dealing with mixed Python code.

MBCook 1 year ago | | |

But it would give you more headroom before rewriting for performance would make sense right? That alone could be beneficial to a lot of people.

paulddraper 1 year ago | | |

> should not be written

IDK what l should and shouldn't be written in, but there are a very large # of proud "pure Python" libraries on GitHub and HN.

The ecosystem seems to even prefer them.

fastasucan 1 year ago | | |

I never understand this sentiment, that shows up in every topic on python. Who descides why something should or should not be written I Python?

Why shouldn't someone who prefers writing in python benefit from using multiple cores?

wokwokwok 1 year ago | |

> there is just a tremendous amount of performance that can be automatically unlocked with almost no incremental effort for so many organizations and projects

This just isn’t true.

This does not improve single threaded performance (it’s worse) and concurrent programming is already available.

This will make it less annoying to do concurrent processing.

It also makes everything slower (arguable where that ends up, currently significantly slower) overall.

This way over hyped.

At the end of the day this will be a change that (most likely) makes the existing workloads for everyone slightly slower and makes the lives of a few people a bit easier when they implement natively parallel processing like ML easier and better.

It’s an incremental win for the ML community, and a meaningless/slight loss for everyone else.

At the cost of a great. Deal. Of. Effort.

If you’re excited about it because of the hype and don’t really understand it, probably calm down.

Mostly likely, at the end of the day, it s a change that is totally meaningless to you, won’t really affect you other than making some libraries you use a bit faster, and others a bit slower.

Overall, your standard web application will run a bit slower as a result of it. You probably won’t notice.

Your data stack will run a bit faster. That’s nice.

That’s it.

Over hyped. 100%.

anwlamp 1 year ago | | |

Yes, good summary. My prediction is that free-threading will be the default at some point because one of the corporations that usurped Python-dev wants it.

The rest of us can live with arcane threading bugs and yet another split ecosystem. As I understand it, if a single C-extension opts for the GIL, the GIL will be enabled.

Of course the invitation to experiment is meaningless. CPython is run by corporations, many excellent developers have left and people will not have any influence on the outcome.

Uptrenda 1 year ago | | |

Why would it make single threaded performance slower? Sorry, but that's kind of ridiculous. You're just making shit up at this point.

quietbritishjim 1 year ago | |

If you're worried about performance then much of your CPU time is probably spent in a C extension (e.g. numpy, scipy, opencv, etc.). Those all release the GIL so already allow parallelisation in multiple threads. That even includes many functions in the standard library (e.g. sqlite3, zip/unzip). I've used multiple threads in Python for many years and never needed to break into multiprocessing.

But, for sure, nogil will be good for those workloads written in pure Python (though I've personally never been affected by that).

Demiurge 1 year ago | |

Massive overhead of multiprocessing? How have I not noticed this for tens of years?

I use coroutines and multiprocessing all the time, and saturate every core and all the IO, as needed. I use numpy, pandas, xarray, pytorch, etc.

How did this terrible GIL overhead completely went unnoticed?

viraptor 1 year ago | | |

> I use numpy, pandas, xarray, pytorch, etc.

That means your code is using python as glue and you do most of your work completely outside of cPython. That's why you don't see the impact - those libraries drop GIL when you use them, so there's much less overhead.

coldtea 1 year ago | |

>using simple threads instead of dealing with the massive overhead and complexity and bugs of using something like multiprocessing

I've never heard threading described as "simple", even less so as simpler than multiprocessing.

Threads means synchronization issues, shared memory, locking, and other complexities.

quotemstr 1 year ago | |

What about the pessimization of single-threaded workloads? I'm still not convinced a completely free-threaded Python is better overall than a multi-interpreter, separate-GIL model with explicit instead of implicit parallelism.

Everyone wants parallelism in Python. Removing the GIL isn't the only way to get it.

Galanwe 1 year ago | |

> It's going to be amazing to saturate all the cores on a big machine using simple threads instead of dealing with the massive overhead and complexity and bugs of using something like multiprocessing.

I'm saturating 192cpu / 1.5TBram machines with no headache and straightforward multiprocessing. I really don't see what multithreading will bring more.

What are these massive overheads / complexity / bugs you're talking about ?

saurik 1 year ago | |

FWIW, I think the concern though is/was that for most of us who aren't doing shared-data multiprocessing this is going to make Python even slower; maybe they figured out how to avoid that?

eigenvalue 1 year ago | | |

Pretty sure they offset any possible slowdowns by doing heroic optimizations in other parts of CPython. There was even some talk about keeping just those optimizations and leaving the GIL in place, but fortunately they went for the full GILectomy.

simonw 1 year ago |

I got this working on macOS and wrote up some notes on the installation process and a short script I wrote to demonstrate how it differs from non-free-threaded Python: https://til.simonwillison.net/python/trying-free-threaded-py...

vanous 1 year ago | |

Thanks for the example and explanations Simon!

nine_k 1 year ago |

Python 3 progress so far:

  [x] Async.
  [x] Optional static typing.
  [x] Threading.
  [ ] JIT.
  [ ] Efficient dependency management.

vegabook 1 year ago |

Clearly the Python 2 to 3 war was so traumatising (and so badly handled) that the core Python team is too scared to do the obvious thing, and call this Python 4.

This is a big fundamental and (in many cases breaking) change, even if it's "optional".

blumomo 1 year ago | |

Did Python as the language change which justified that version bump?

mixmastamyk 1 year ago | | |

When on, there are incompatibilities yes.

There were a lot of smaller breaking changes over the years, especially 3.10 that probably should have been a 4.0.

Sparkyte 1 year ago |

My body is ready. I love python because the ease of writing and logic. Hopefully the more complicated free-threaded approach is comprehensive enough to write it like we traditionally write python. Not saying it is or isn't I just haven't dived enough into python multithreading because it is hard to put those demons back once you pull them out.

ameliaquining 1 year ago | |

The semantic changes are negligible for authors of Python code. All the complexity falls on the maintainers of the CPython interpreter and on authors of native extension modules.

stavros 1 year ago | | |

Well, I'm not looking forward to the day when I upgrade my Python and suddenly I have to debug a ton of fun race conditions.

hot_gril 1 year ago | |

What are the common use cases for threading in Python? I feel like that's a lower level tool than most Python projects would want, compared to asyncio or multiprocessing.Pool. JS is the most comparable thing to Python, and it got pretty darn far without threads.

BugsJustFindMe 1 year ago | | |

Working with asyncio sucks when all you want is to be able to do some things in the background, possibly concurrently. You have to rewrite the worker code using those stupid async await keywords. It's an obnoxious constraint that completely breaks down when you want to use unaware libraries. The thread model is just a million times easier to use because you don't have to change the code.

kstrauser 1 year ago | | |

It’s hard to say because we’ve come up with a lot of ways to work around the fact that threaded Python has always sucked. Why? Because there’d been no demand to improve it. Why? Because no one used it. Why? Because it sucked.

I’m looking forward to seeing how people use a Python that can be meaningfully threaded. While It may take a bit to built momentum, I suspect that in a few years there’ll be obvious use cases that are widely deployed that no one today has even really considered.

bongodongobob 1 year ago | | |

Same as any other language. Separating UI from calculations is my most common need for it.

ZhongXina 1 year ago | |

Precisely, ease of writing, not ease of reading (the whole project, not just a tiny snippet of code) or supporting it long-term.

mihaic 1 year ago |

Does anyone know if there is more serious single threaded performance degradation (more than a few percent for instance)? I couldn't find any benchmarks, just some generic reassurance that everything is fine.

ngoldbaum 1 year ago | |

Right now there is a significant single-threaded performance cost. Somewhere from 30-50%. Part of what my colleague Ken Jin and others are working on is getting back some of that lost performance by applying some optimizations. Expect single-threaded performance to improve for Python 3.14 next year.

arp242 1 year ago | | |

To be honest, that seems a lot. Even today a lot of code is single-threaded, and this performance hit will also affect a lot of code running in parallel today.

There have been patches to remove the GIL going back to the 90s and Python 1.5 or thereabouts. But the performance impact has always been the show-stopper.

andmkl 1 year ago | | |

That would be in the order of previous GIL-removal projects, which were abandoned for that reason.

imtringued 1 year ago | | |

That kind of negates the whole purpose of multi threading. An application running on two cores might end up slower, not faster. We know that the python developers are kind of incompetent when it comes to performance, but the numbers you are quoting are so bad they probably aren't correct in the first place.

ngoldbaum 1 year ago | | |

Clarifying a few days later: single-threaded performance in the normal ABI with the GIL does not have the same performance degradation. You only see the performance hit if you’re testing the experimental 3.13 free-threaded release.

deschutes 1 year ago | |

To my understanding there is and there isn't. The driving force behind this demonstrated that it was possible to speed up the existing CPython interpreter by more than the performance cost of free threading with changes to the allocator and various other things.

So the net is actually a small performance win but lesser than if there was no free threading. That said, many of the techniques he identified were immediately incorporated into CPython and so I would expect benchmarks to show some regression as compared with the single threaded interpreter of the previous revision.

nhumrich 1 year ago | |

Irrelevant, because even if there was, you would use the normal GIL python for it.

discreteevent 1 year ago |

I remember back around 2007 all the anxious blog posts about the free lunch (Moore's law) being over. Parallelism was mandatory now. We were going to need exotic solutions like software transactional memory to get out of the crisis (and we could certainly forget about object orientation).

Meanwhile what takes the crown? - Single threaded python.

(Well, ok Rust looks like it's taking first place where you really need the speed and it does help parallelism without requiring absolute purity)

jeremycarter 1 year ago | |

Takes what crown? Python is horrifically slow even single threaded. It's by far the slowest and most energy inefficient of the major choices available today.

pansa2 1 year ago | | |

Popularity

farhanhubble 1 year ago |

It remains to be seen how many subtle bugs are now introduced by programmers who have never dealt with real multithreading.

jmward01 1 year ago |

I know, I know, 'not every story needs to be about ML' but.... I can only imagine how unlocking the GIL will change the nature of ML training and inference. There is so much waste and complexity in passing memory around and coordinating processes. I know that libraries have made it (somewhat) easier and more efficient but I can't wait to see what can be done with things like pytorch when optimized for this.

ipsum2 1 year ago | |

It'll mostly help for debugging and lowering RAM (not VRAM) usage. Otherwise it won't impact ML much.

jmward01 1 year ago | | |

Pretty universally I have seen performance improvements in code when complexity is reduced and this could drop complexity considerably. I wouldn't be surprised to see a double digit percent improvement in tokens per sec when an optimized pytorch eventually comes out with this. There may even be hidden gains on GPU memory usage that come out of this as people clean up code and start implementing better tricks because of it.

imtringued 1 year ago | | |

Yeah, one of the dumbest things about Dataloaders running in a different process is that you are logging into the void.

veber-alex 1 year ago | |

huh?

Any python library that cares about performance is written in C/C++/Rust/Fortran and only provides a python interface.

ML will have 0 benefit from this.

jmward01 1 year ago | | |

Have you done any multi-gpu training? Generally every GPU gets a process. Coordinating between them and passing around data between them is complex and can easily have performance issues since normal communication between python processes requires some sort of serialization/de-serialization of objects (there are many * here when it comes to GPU training). This has the potential to simplify all of that and remove a lot of inter-process communication which is just pure overhead.

KeplerBoy 1 year ago | | |

Of course ML will benefit from it. Soon you will be able to run your dataloaders/data preprocessing in different threads which will not starve your GPUs of data.

bdd8f1df777b 1 year ago | | |

If you have done ML with PyTorch or Tensorflow you will know how much multithreading can improve data loading performance. Currently multiprocessing provides the necessary parallelization of data loading but it is painful and riddle with bugs.

westurner 1 year ago |

Will there be an effort to encourage devs to add support for free-threaded Python like for Python 3 [1] and for Wheels [2]?

Is there a cibuildwheel / CI check for free-threaded Python support?

Is there already a reason not to have Platform compatibility tags for free-threaded cpython support? https://packaging.python.org/en/latest/specifications/platfo...

Is there a hame - a hashtaggable name - for this feature to help devs find resources to help add support?

Can an LLM almost port in support for free-threading in Python, and how should we expect the tests to be insufficient?

"Porting Extension Modules to Support Free-Threading" https://py-free-threading.github.io/porting/

[1] "Python 3 "Wall of Shame" Becomes "Wall of Superpowers" Today" https://news.ycombinator.com/item?id=4907755

[2] https://pythonwheels.com/

(Edit)

Compatibility status tracking: https://py-free-threading.github.io/tracking/

westurner 1 year ago | |

(2021) https://news.ycombinator.com/item?id=29005573#29009072 :

python-feedstock / recipe / meta.yml: https://github.com/conda-forge/python-feedstock/blob/master/...

pypy-meta-feedstock can be installed in the same env as python-feedstock; https://github.com/conda-forge/pypy-meta-feedstock/blob/main...

westurner 1 year ago | |

Install commands from https://py-free-threading.github.io/installing_cpython/ :

  sudo dnf install python3.13-freethreading

  sudo add-apt-repository ppa:deadsnakes
  sudo apt-get update
  sudo apt-get install python3.13-nogil

  conda create -n nogil -c defaults -c ad-testing/label/py313_nogil python=3.13

  mamba create -n nogil -c defaults -c ad-testing/label/py313_nogil python=3.13

TODO: conda-forge ?, pixi

elijahbenizzy 1 year ago |

I'm really curious to see how this will work with async. There's a natural barrier (I/O versus CPU-bound code), which isn't always a perfect distinction.

I'd love to see a more fluid model between the two -- E.G. if I'm doing a "gather" on CPU-bound coroutines, I'm curious if there's something that can be smart enough to JIT between async and multithreaded implementations.

"Oh, the first few tasks were entirely CPU-bound? Cool, let's launch another thread. Oh, the first few threads were I/O-bound? Cool, let's use in-thread coroutines".

Probably not feasible for a myriad of reasons, but even a more fluid programming model could be really cool (similar interfaces with a quick swap between?).

grandimam 1 year ago |

How is the no-gil performance compared to other languages like - javascript (nodejs), go, rust, and even java? If it's bearable then I believe there is enormous value that could be generated instead of spending time porting to other languages.

pansa2 1 year ago | |

No-GIL Python is still interpreted - single-threaded performance is slower that standard Python, which is in turn much slower than the languages you mentioned.

Maybe if you’ve got an embarrassingly parallel problem, and dozen(s) of cores to spare, you can match the performance of a single-threaded JIT/AOT compiled program.

vulnbludog 1 year ago | | |

How do companies like Instagram/OpenAI scale with a majority python codebase? Like I just kick it on HN idk much about computers or coding (think high school CS) why wouldn’t they migrate can someone explain like I’m five

thebigspacefuck 1 year ago | |

Here’s a benchmark https://github.com/lip234/python_313_benchmark

It’s much worse except in everything but a threaded test

VagabundoP 1 year ago |

Highly recommend the core.py podcast if you're interested in the background, there are a few episodes that focus on the GILectomy:

-Episode 2: Removing the GIL[1]

-Episode 12: A Legit Episode[2]

[1]https://www.youtube.com/watch?v=jHOtyx3PSJQ&list=PLShJCpYUN3...

[2]https://www.youtube.com/watch?v=IGYxMsHw9iw&list=PLShJCpYUN3...

vldmrs 1 year ago |

Great news ! It would be interesting to see performance comparison for IO-bound tasks like http requests between single-threaded asyncio code and multi-threaded asyncio

pansa2 1 year ago |

PEP703 explains that with the GIL removed, operations on lists such as `append` remain thread-safe because of the addition of per-list locks.

What about simple operations like incrementing an integer? IIRC this is currently thread-safe because the GIL guarantees each bytecode instruction is executed atomically.

pansa2 1 year ago | |

Ah, `i += 1` isn’t currently thread-safe because Python does (LOAD, +=, STORE) as 3 separate bytecode instructions.

I guess the only things that are a single instruction are some modifications to mutable objects, and those are already heavyweight enough that it’s OK to add a per-object lock.

jillesvangurp 1 year ago | | |

That sounds like the kind of thing that a JIT compiler should be optimizing. The problem with threading isn't stuff like this but people doing a lot of silly things like having global mutable state or stateful objects that are being passed around a lot.

I've done quite a bit of stuff with Java and Kotlin in the past quarter century and it's interesting to see how much things have evolved. Early on there were a lot of people doing silly things with threads and overusing the, at the time, not so great language features for that. But a lot of that stuff replaced by better primitives and libraries.

If you look at Kotlin these days, there's very little of that silliness going on. It has no synchronized keyword. Or a volatile keyword, like Java has. But it does have co-routines and co-routine scopes. And some of those scopes may be backed by thread pools (or virtual thread pools on recent JVMs).

Now that python has async, it's probably a good idea to start thinking about some way to add structured concurrency similar to that on top of that. So, you have async stuff and some of that async stuff might happen on different threads. It's a good mental model for dealing with concurrency and parallelism. There's no need to repeat two decades of mistakes that happened in the Java world; you can fast forward to the good stuff without doing that.

gnatolf 1 year ago |

Good to hear. The authors are touching on the journey it is to make Cython continue to work. I wonder how hard it'll be to continue to provide bdist packages, or within what timeframe, if at all, Cython can transparently ensure correctness for a no-gil build. Anyone got any insights?

codethief 1 year ago |

Yesterday someone presented preliminary benchmarks here at EuroPython 2024, comparing no-GIL to sub-interpreters and to multiprocessing. Upshot: This gon' be good!

earthnail 1 year ago |

Oh how much this would simplify torch.DataLoader (and its equivalents)…

Really excited about this.

throwaway5752 1 year ago |

GVR, you are sorely missed, though I hope you are enjoying life.

nas 1 year ago |

Very encouraging news!

OutOfHere 1 year ago |

It has been ready for a few months now, at least since 3.13.0 beta 1 which released on 2024-05-08, although alpha versions had it working too. I don't know why this is news now.

With it, the single-threaded case is slower.

TylerE 1 year ago | |

FTA: "Yesterday, py-free-threading.github.io launched! It's both a resource with documentation around adding support for free-threaded Python, and a status tracker for the rollout across open source projects in the Python ecosystem."

OutOfHere 1 year ago | | |

Before the article came the misleading title: "Free-threaded CPython is ready to experiment with".

The link should have been to https://py-free-threading.github.io/tracking/

JBorrow 1 year ago | |

This release coincides with the SciPy 2024 conference and a number of other things. I would suggest reading the article to learn more.

OutOfHere 1 year ago | | |

> This release

What release. The last release of CPython was 3.13.0b3 on 2024-06-27.

SciPy is irrelevant to the title.

anacrolix 1 year ago |

Was ready for this 15 years ago when I loved Python and regularly contributed. At the time, nobody wanted to do it and I got bored and went to Go.