Parallel Programming with Python

Parallel Programming with Python(chryswoods.com)

218 points by uaaa 7 years ago | 141 comments

In response to the multiple comments here complaining that multithreading is impossible in Python without using multiple processes, because of the GIL (global interpreter lock):

This is just not true, because C extension modules (i.e. libraries written to be used from Python but whose implementations are written in C) can release the global interpreter lock while inside a function call. Examples of these include numpy, scipy, pandas and tensorflow, and there are many others. Most Python processes that are doing CPU-intensive computation spend relatively little time actually executing Python, and are really just coordinating the C libraries (e.g. "mutiply these two matrices together").

The GIL is also released during IO operations like writing to a file or waiting for a subprocess to finish or send data down its pipe. So in most practical situations where you have a performance-critical application written in Python (or more precisely, the top layer is written in Python), multithreading works fine.

If you are doing CPU intensive work in pure Python and you find things are unacceptably slow, then the simplest way to boost performance (and probably simplify your code) is to rewrite chunks of your code in terms of these C extension modules. If you can't do this for some reason then you will have to throw in the Python towel and re-write some or all of your code in a natively compiled language (if it's just a small fraction of your code then Cython is a good option). But this is the best course of action regardless of the threads situation, because pure Python code runs orders of magnitude slower than native code.

chrisseaton 7 years ago | |

> complaining that multithreading is impossible in Python without using multiple processes, because of the GIL ... this is not true

I think some people's opinions is that if you're writing in C then you're not really writing a Python program, so they think it is impossible in Python. Which seems a reasonable point to make to me.

Your argument is that Python is fine for multithreading... as long as you actually write C instead of Python.

quietbritishjim 7 years ago | | |

Let's say I write this:

    def add_and_mult(a, b, c):
        return a + b @ c

If a, b and c are numpy arrays then this function releases the GIL and so will run in multiple threads with no further work and with little overhead (if a, b and c are large). I would describe this as a function "written in Python", even though numpy uses C under the hood. It seems you describe this snippet as being "written in C instead of Python"; I find that odd, but OK.

But, if I understand you right, you are also suggesting that the other commenters here that talk about the GIL would also describe this as "written in C". They realise that this releases the GIL and will run on multiple threads, but the point of their comments is that proper pure Python function wouldn't. I disagree. I think that most others would describe this function as "written in Python", and when they say that functions written in Python can't be parallelised they do so because they don't realise that functions like this can be.

ubernostrum 7 years ago | | |

A whole lot depends on what exactly it is that someone wants to get out of using threading.

The GIL means that a single Python interpreter process can execute at most one Python thread at a time, regardless of the number of CPUs or CPU cores available on the host machine. The GIL also introduces overhead which affects the performance of code using Python threads; how much you're affected by it will vary depending on what your code is doing. I/O-bound code tends to be much less affected, while CPU-bound code is much more affected.

All of this dates back to design decisions made in the 1990s which presumably seemed reasonable for the time: most people using Python were running it on machines with one CPU which had one core, so being able to take advantage of multiple CPUs/cores to schedule multiple threads to execute simultaneously was not necessarily a high priority. And most people who wanted threading wanted it to use in things like network daemons, which are primarily I/O-bound. Hence, the GIL and the set of tradeoffs it makes. Now, of course, we carry multi-core computers in our pockets and people routinely use Python for CPU-bound data science tasks. Hindsight is great at spotting that, but hindsight doesn't give us a time machine to go back and change the decisions.

Anyway. This is not the same thing as "multithreading is impossible". This is the same thing as "multithreading has some limitations, and for some cases the easiest way to work around them will be to use Python's C extension API". Which is what the parent comment seemed to be saying.

Rotareti 7 years ago | |

Does anyone know how well Python and Rust team up compared to Python and C in practice?

ikornaselur 7 years ago | | |

I've yet to play with beyond just experimenting a little bit, but it seems it works very well.

I've mainly been looking at these resources:

https://github.com/rochacbruno/rust-python-example

https://github.com/PyO3/pyo3

Though I have not done rust <-> python in real practice

jwandborg 7 years ago | | |

Subjectively I'm really impressed by PyO3.

If you care about speed, Rust is supposedly as fast as C. The Rust ecosystem also has a lot of supposedly safe(!) tools for parallelism.

gpm 7 years ago | | |

I've done it once, converting about 15 lines of python to rust. It was completely painless and resulted in a large speedup (changed a hotspot that was taking approximately 90% of execution time in a scientific simulation to approximately 0%).

Type system and expressive macros seems like a big win over c to me.

charlescearl 7 years ago | |

Also a nice short talk by Caleb Hattingh https://www.youtube.com/watch?v=NfnMJMkhDoQ

quietbritishjim 7 years ago | | |

> talk [about Cython]

That was interesting, thanks!

I really wish he had shown his numpy code. He said at 13:46 "Numpy actually doesn't help you at all because the calculation is still getting done at the Python level". But his function could be vectorised with numpy using functions like numpy.maximum or numpy.where, in which case the main loop will be in C not Python. I can't figure out from what he said whether his numpy code did that or not.

But either way, it's interesting that in this case the numpy version is arguably harder to write than the Cython version: rather than just adding a few bits of metadata (the types), you have to permute the whole control flow. If there's only a small amount of code you want to convert, I would still say it's better to use numpy though (if it actually is fast enough), because getting the build tools onto your computer for Cython can be a pain. And for some matrix computation there are speed inprovements above the fact that it's implemented in C e.g. matrix multiplication is faster than the naive O(n^3) version.

woolvalley 7 years ago | |

Because we want first class python multithreading, like many other languages have. If we have to drop down into C, might as well use another language with first class multi-threading like java, kotlin, golang or swift and avoid all the other issues that come with slow GIL languages.

TheCondor 7 years ago | |

The thread workers pool circumvent the GIL if you carefully follow some rules. I think the arguments and results need to be pickleable.

quietbritishjim 7 years ago | | |

I think you're thinking of the multiprocessing module, which uses separate processes to bypass the GIL. That's why the arguments and results need to be pickleable: pickle is a serialisation procotol, so it allows you to communicate the contents of objects between different processes. If you use threads within a single process, you don't need to pickle the objects; you just pass the object directly.

walterstucco 7 years ago | |

that's not writing Python though

elcombato 7 years ago |

> (note that you must be using Python 2 for this workshop and not using Python 3. Complete this workshop using Python 2, then read about the small changes if you are interested in using Python 3)

Why using legacy Python for this?

ggm 7 years ago | |

why not re-write the workshop for python3 and require python2 users to wear the pain downgrade brings?

brennebeck 7 years ago | | |

Because python2 is a deprecated language that will EOL?

kjeetgill 7 years ago | |

I'm not sure it's fair to call it legacy just yet. Most linux distributions (minus Arch I think) still use 2.7 as the default.

I get EOL/deprecation is here but lets not jump the gun to legacy just yet. I just see more 2 than 3 @ Day Job.

ilovetux 7 years ago |

I find it strange that nobody ever seems to mention python's concurrent.futures module [0] which is new in Python 3.2. I think asyncio got a lot of attention when it came out in Python 3.4 and concurrent.futures took a back seat. This article also doesn't mention the module in it's Python 2 and 3 differences link.

asyncio is a good library for asyncronous I/O but concurrent.futures gives us some pretty nifty tooling which makes concurrent programming (with ThreadPoolExecutor) and parallel programming (with ProcessPoolExecutor) pretty easy to get right. The Future class is a pretty elegant solution for continuing execution while a background task is being executed.

[0] https://docs.python.org/3/library/concurrent.futures.html

ZeroCool2u 7 years ago | |

ThreadPoolExecutor and ProcessPoolExecutor were exactly what I was waiting for someone to mention. I was doing some Python as a systems architect at my previous position and now as a full time data scientist, my life has pretty much been consumed by Python. Unsurprisingly, a lot of my initial work is retrieving and cleaning very large volumes of data, the later usually being I/O bound and the former being CPU bound and frankly myself and a lot of my team immediately default to using both ThreadPoolExecutor and ProcessPoolExecutor respectively, because of how simple and performant they are. Perhaps asyncio is more familiar terminology to people coming from Web Dev, so that's why they're gravitating towards it, but there are few times when I find myself needing that particular tooling outside of Web Dev anyways.

mpweiher 7 years ago |

"...take advantage of the processing power of multicore processors"

Step 1: stop using Python.

"You can have a second core when you know how to use one"

Now don't get me wrong, Python is a perfectly fine language for lots of things, but not for taking optimal advantage of the CPU.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Relative performance compared to C is somewhere between an order of magnitude or two slower. Considering how much harder and more error-prone multi-core is, maybe first try a fast sequential solution.

TickleSteve 7 years ago | |

You're absolutely right (but you're probably gonna get some downvotes for saying that).

The ratio between the most-performant parallel framework and the least on Python will be a factor of (guessing) 1.5.

The ratio between a CPU-bound algorithm written in C and one in Python will be of the order of 10000 (again guessing as it's application-dependent).

Where is your time most profitably spent?

devxpy 7 years ago | | |

Wild Guess: You're not really a python programmer, are you?

Just curious...

auggierose 7 years ago | |

Yeah. Recently switched some Blender Python algorithms I wrote to Swift/Metal, and the speedup was somewhere between 1000 and 1000000 depending on the algorithm.

stevesimmons 7 years ago | | |

Speedups of that magnitude suggest the original Python approach was particularly inefficient...

kilon 7 years ago | | |

Who would have guessed that compiled, static, non-dynamic, hardware accelerated code would be a ton more performant than runtime, highly dynamic, garbage collected and very powerful code that is not hardware accelerated.

igouy 7 years ago | |

> Considering how much harder and more error-prone multi-core is, maybe first try a fast sequential solution.

Most of the Python programs referenced on that benchmarks game webpage are in-fact using multi-core ?

eternauta3k 7 years ago | |

Why? I can run my Python plotting script with multiprocessing in one of the blade servers at work and get the job done quickly. All without translating a big bunch of code to C.

ram_rar 7 years ago |

I love python. But its seriously, incapable for doing non trivial concurrent tasks. Multiprocessing module doesnt count. I hope the python core-devs take some inspiration from golang for developing the right abstractions for concurrency.

andbberger 7 years ago |

IMO ray[1] is the greatest thing to happen in python parallelism since the invention of sliced bread.

Also includes best currently available hyperparameter tuning framework!

[1] https://github.com/ray-project/ray

another-cuppa 7 years ago |

I think a lot of this complexity can be avoided by just writing single threaded python and using GNU parallel for running it on multiple cores. You can even trivially distribute the work across a cluster that way.

quiq 7 years ago | |

This is the approach I've taken, albeit at the "top level" of the program. Since I know I don't have to deal with Windows I much prefer simply piping to parallel instead of xargs, or calling make -j8, or similarly letting some shell wrapper handle it over dealing with the overhead inside of python, especially multiprocessing.

However, where I think having this stuff available inside of python is useful is that it's cross platform and consumable from "higher levels" of python. A library can do some mucky stuff internally to speed computation but still present a simple sync interface, all without external dependencies.

jillesvangurp 7 years ago |

Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python. I've done a bit of batch processing using the multi process module; which uses processes instead of threads. This works but it is a bit of a kludge if you are used to languages that support concurrency properly.

mwyau 7 years ago |

mpi4py should be included. It's a wrapper for the MPI library, which is the de facto standard for scientific computing: https://mpi4py.readthedocs.io/en/stable/

natvert 7 years ago |

Sweet, a guide! I always end up rolling my own thread pool / manager. I wish something like the parallel gem for Ruby existed in pyland...

guiriduro 7 years ago | |

If your tasks are fairly coarse-grained (take >50ms each), Celery [1] has existed for a several years; takes a bit of setting up but works well, its very flexible. If your needs are simple, don't forget that your common or garden webserver can parallelize workloads too (distribute web requests to workers on multiple cores), it depends mostly on your client code for fan-out, and redis has worked well for synchronization for me.

Nowadays you can also use serverless to parallelize coarse-grained workloads in the cloud.

[1] http://www.celeryproject.org/

magwa101 7 years ago |

Concurrency in python always ends up the reason to drop it and reimplement in Go. Also, the code ends up littered with type checks....

wenning 7 years ago |

i think use python3 multiprocess and async is better for product.

gnufx 7 years ago |

Multi-core parallelism isn't so interesting for serious computation. You want to be able to use large distributed HPC systems, but Python doesn't seem to have the equivalent of https://pbdr.org for R.

kilon 7 years ago |

One more epic discussion on Python, where we have the unique opportunity to learn that using C libraries from Python is "cheating".

I could not agree more

It's definitely cheating to use C code with the exception of most Python libraries that already are to a large extent nothing more than thin wrappers over existing C libraries or the tiny fact that the most popular by far implementation of Python , CPython, is almost 50% implemented in the C language, including the standard library.The author even dared include "C" in the name of the implementation.

Those cheaters, becoming bolder and bolder every day.

Damn them !!!

goerz 7 years ago |

The GIL has considerable benefits: I don’t have to worry about whether Python functions are thread-safe. Thread-based parallelism is hard to get right, and given the number of workarounds, Python’s GIL is a total non-issue.

jashmatthews 7 years ago | |

> The GIL has considerable benefits: I don’t have to worry about whether Python functions are thread-safe.

Hold on, the GIL doesn't make Python automatically thread-safe!

You can still have classic data races as the VM can pause and resume two threads writing to the same variable.

goerz 7 years ago | | |

Can you elaborate on that? Is there a blog post somewhere that illustrates the problem you're talking about? I was under the assumption that Python interpreters run single-threaded.

devxpy 7 years ago | |

Small correction: It makes the _implementation_ thread-safe.

It also simplifies a lot of CPython code, making it a lot easier to maintain.

walterstucco 7 years ago |

> Parallel Programming with Python?

What about no?

Don't get me wrong, i don't like Python as a language, but it's a fine tool and many useful programs have been written with it

But parallel programming? No, thanks.