Threading in Python

55 points by nry 13 years ago | 29 comments

exDM69 13 years ago |

> Generally, you should only use threads if the following is true: - Sharing memory between threads is not an issue.

Here's the problem. Threads are really useful only if you can share memory between threads. If you can't share memory, you're usually better off using many processes.

Threads in Python (ie. CPython) can still be useful for I/O multiplexing or executing native code in background worker threads via FFI and releasing the GIL while doing so. For I/O multiplexing, there are better options than Python threads (select/poll/kqueue/epoll system calls and frameworks like twisted that use them).

In most applications, threads probably should not be used in CPython/CRuby code as they provide little performance gain compared to the complexity and overhead they add.

prodigal_erik 13 years ago | |

http://en.wikipedia.org/wiki/Communicating_sequential_proces... works well passing immutable object graphs back and forth. Passing by value (copying everything) has a cost, and serializing everything down to byte streams across a pipe is even more expensive, especially if you don't know which portions of the object graph will and won't be needed for a given call (an optimization which imposes tight coupling on details about the code you're calling). If I'm not calling untrusted code, and not planning to divide the work across many machines, I'd prefer to avoid needless process boundaries.

lifeisstillgood 13 years ago | |

Thank you. I would even go so far as to say that except in simple cases (downloading 10,000 images goes much faster with 100 worker threads than serially - which is I think the origin of "dont share memory) I would say do not use Python - or any other similar language.

Got parallel needs at your core? Look at Erlang or Haskell. If parallel or distributed work is mission critical, go with a language that has such things at its very soul. Python is a great language, but it is being enthusiastically bent to do things it is not top of the class for.

Want to handle more concurrent connections per python web server? If WSGI in Gunicorn is not enough, stop trying and use a load balancer to spread work between more servers.

obviouslygreen 13 years ago | | |

You are technically correct. However, there's at least one invalid assumption at the core of this, which is that people who need to do things that fall into the there's-a-better-language-for-this category always have the opportunity to learn and implement a more appropriate tool.

This is almost always the case on commercial projects. Extremely few companies and clients will be perfectly fine with "yes, I'm a Python expert, but this would be best done in Erlang; I will need an extra week to research, learn, and implement this on top of the month the project would otherwise take." In most situations you either do it the way you know how to do it, eat the extra time (not practical in most cases), or you lose the contract/job.

Of course this is specific to client work, but I think most of us are likely doing that or something similarly limiting for at least half our waking hours, making it fairly relevant when considering ideas like "using tool X for job A is not a good idea when tool Y exists." It's correct but ignores too many practical situations to be very useful advice.

seanp2k2 13 years ago |

Some developers, when confronted with a problem, think "I know, I'll use threads" have two Now problems they.

http://regex.info/blog/2006-09-15/247

rozap 13 years ago | |

Clever. But just like with regexps, the issue arises when a programmer applies them to every situation he/she encounters. Obviously they need not be avoided like the plague, but rather used when the situation calls for them.

ramidarigaz 13 years ago |

Where does the GIL factor into this? I thought using threads in Python gives basically no gains unless all the heavy work is being done outside the interpreter. I've always gone with the multiprocessing module instead.

bvdbijl 13 years ago | |

Python threads are only useful if you use C modules that handle the GIL correctly and I/O bound stuff from the standard library, it gives no speed boost for python code

DeepDuh 13 years ago | | |

here's one more usecase which is probably very common: When you use python as glue to call other programs which can run asynchronously. As an example I've used python once to implement a parallelized genetic algorithm where the evaluation function was a matlab program. It was quite a breeze to spawn hundreds of such threads over ssh using one PC as the controller - if only I hadn't shocked the local sysadmins ;-).

tantalor 13 years ago | | |

If not for performance, why would anybody use it? Scalability?

andrewguenther 13 years ago | |

This stems back to parallelism vs. concurrency. Yes, it is true that your Python threads will not run in parallel but they will run concurrently. If you're curious about parallelism vs. concurrency, here is a great talk by Rob Pike on the subject: http://vimeo.com/49718712

eidorb 13 years ago |

I've implemented something similar using Eli Bendersky's example [1] as a guide. His example adds a stop event. I pass my worker thread lambdas, so that arbitrary tasks can be carried out.

[1] http://eli.thegreenplace.net/2011/12/27/python-threads-commu...

ctoth 13 years ago |

I find concurrent.futures to be a much nicer way of managing this sort of thing in Python. It's in 3.2 I believe, and there's a backport for 2 at https://pypi.python.org/pypi/futures/2.1.3 You can set up as many executors as you like, each with a given amount of threads to use for its threadpool.

scott_s 13 years ago |

You are not looking for the best optimized performance since threads share memory within a process.

That is a non-sequitur to me. The first half I'm on board with: generally, you use threads to improve performance, but because of the GIL in Python, you may not get the parallelism you want. If you're calling into libraries that don't hold the GIL, then great, but that means you have to be very aware of what's going on below you.

The second half does not follow, though. Typically, that threads share the same address space is the entire reason we use threads over processes. And the reason comes from improved performance: if the thread share an address space, you don't need to copy the data. Copying data is expensive. (It also means you're susceptible to a whole host of synchronization bugs.)

pekk 13 years ago | |

Sharing all data between threads means you're susceptible to a whole host of synchronization bugs (in the sense of thread synchronization, not data synchronization). Unless you use synchronization primitives like locks to protect the shared data, which can also easily kill concurrency. It's a trade-off.

If avoiding copying is not a top problem, then you may be wasting your time; there's nothing wrong with using abstractions more appropriate to your environment.

If the program scales out, it should be less important to micro-optimize inside each process because it's so much cheaper just to use another core or another node.

It's getting boring to hear all discussions of concurrency reduced to threads, and threads reduced to the GIL in CPython. It's really not that simple.

scott_s 13 years ago | | |

Yes, it's a trade-off, which is why I brought it up.

But my point here is that the statement the author made, as far as I'm able to understand it, makes no sense. That is, I think he tried to discuss these issues, but I don't think he understands them well enough to do so. I think you and I are in agreement, unless you are saying that what the author stated does make sense.

stefantalpalaru 13 years ago |

This is Python so threads don't run concurrently because of the GIL (no, using C/C++ code where you can release it is not in the scope of this article). Save yourself the trouble and use multiprocessing.

pekk 13 years ago | |

It's oversimplifying to say "This is Python so threads don't run concurrently because of the GIL".

This is not a Python-language issue, it is an implementation-specific issue. Not all implementations of Python have the GIL.

In reality, threads do run concurrently. Because in the CPython (itself written in C) with the famous GIL, it is normal and realistic to do I/O and heavy computation in C code that releases the GIL, enabling threads to work concurrently. There's no reason this information shouldn't be part of discussions on threads in Python.

That doesn't mean threads are great for everything, but the severity of the case is easily and frequently overstated.

mctx 13 years ago |

    for i in xrange(len(item_list)):

Could be more clearly written as:

    for _ in item_list:

laurenceputra 13 years ago |

GIL? I've found that a single thread is slower than not running a thread at all.