Attempts to make Python fast

Attempts to make Python fast(sethops1.net)

135 points by Queue29 5 years ago | 162 comments

Python is fundamentally not designed to be faster because it leaks a lot of stuff that’s inherently slow that real world code depends on. That’s mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

The only way to speed it up would be to change the language.

overgard 5 years ago | |

I don't think that's really true, those things are definite challenges, but PyPy is still significantly faster than CPython while (afaik) allowing that sort of stuff to go on. If you wanted C/Rust level performance than yeah, you need to redesign the language, but if you just want an interpreter that runs 5-10x faster than what they have now? Both doable and has been done.

Rochus 5 years ago | | |

PyPy is about four times faster than CPython (https://speed.pypy.org/) which is not that much compared to the effort. Node.js is about 13 times faster (https://benchmarksgame-team.pages.debian.net/benchmarksgame/...).

the_mitsuhiko 5 years ago | | |

> PyPy is still significantly faster than CPython while (afaik) allowing that sort of stuff to go on

First of all that's only true when it managed to jit the code, secondly only until you try to do any of those slow things. For instance the C ABI emulation they have both cannot support all of CPython and wrecks performance. The same is true if you try to do fancy things with sys._getframe which a lot of code does in the wild (eg: all of logging).

In addition PyPy has to do a lot of special casing for all the crazy things CPython does. I recommend looking into the amount of engineering that went into it.

kissgyorgy 5 years ago | | |

PyPy is faster at the price of higher memory usage, which is not always desirable.

ilyagr 5 years ago | |

I found the following talks by Armin Romacher very informative on these topics (C API, why python is more difficult to speed up than JS).

https://youtu.be/qCGofLIzX6g https://youtu.be/IeSu_odkI5I

I wish these were the things Python 3 addressed, rather than Unicode. I guess it's much more obvious in hindsight than back when Python 3 was designed.

mumblemumble 5 years ago | | |

I would guess that, if Python 3 hadn't addressed Unicode, Python would never have come to a place where so many people are worried about its performance.

Python's still a great language for the things it was being designed for back in the 2000s. But adding decent Unicode support is a big part of what helped it become an attractive language for use cases where I wish it performed better or had better support for parallelism. Natural language processing, for example.

arc776 5 years ago | | |

Just want to say thanks for these links, very interesting so far.

A point made in the video that seems to highlight the issue:

> Just adding two numbers requires 400 lines of code.

In compiled languages, this is one instruction! Think about the cache thrashing and memory loading involved in this one operation too. How can this possibly be fixed?

Python is a great language, but I don't know if it can ever be high performance on its own.

baq 5 years ago | | |

unicode absolutely had to be done. it'd be even more insane to leave strings as they were. maybe if you never venture outside of 7 bits it's only pain with negative ROI, but trust me the world has more languages than english and first-class support for unicode strings as just strings is a must. it was a painful transition but a necessary one. all other modern languages simply started there (and they're old enough to have a beer, too).

carabiner 5 years ago | | |

(OP is Armin)

rmrfstar 5 years ago | |

Python is the duct tape of programming languages.

Never the best tool if you have strict performance requirements, but so damn versatile it will never go away.

Cython does need better docs though, the steep learning curve means it is under-utilized.

johnisgood 5 years ago | | |

> Python is the duct tape of programming languages.

For some that glue is Forth. :D

> A guy named Jean-Paul Wippler is considering using Forth as a super glue language to bind Python Perl and Tcl together in a project called Minotaur (http://www.equi4.com/minotaur/minotaur.html).

> Forth is an ideal intermediary language, precisely because it's so agile. Otherwise, it wouldn't have been chosen for OpenFirmware, which when you think about it, is a Forth system that must interface to a potentially wide variety of programming language environments.

bydo 5 years ago | | |

We said the same thing about Perl for a couple decades.

edsac_xyzw 5 years ago | | |

This duct type is the Python native extension api (not python ctypes) which allows creating native code modules (aka libraries) in C or C++ and creating wrappers to existing C or C++ libraries. This escape hatch that enables offloading cpu-intensive computations to high performance libraries written in C, C++ or Fortran. Another benefit of python modules written in C or C++ is that they are not affected by the GIL (Global Interpreter Lock) problem, thus they can take advantage of multi-core and SIMD instructions and achieve higher performance.

centimeter 5 years ago | | |

And like things made out of duct tape, I’ve never found anything made using python that actually functioned well.

Liquid_Fire 5 years ago | |

You could say the same about JavaScript, but with very heavy investment there are now several implementations that have improved its performance significantly.

Also see PyPy, which manages to squeeze a lot more performance out of Python for many use cases without changing the language.

ihnorton 5 years ago | | |

The principal developer of Pyston commented on the JavaScript comparison recently [1]:

> This is a common view but I've never heard it from someone who has tried to optimize Python. Personally I think that Python is as much more dynamic than JavaScript as JavaScript is than C.

[1] https://news.ycombinator.com/item?id=23247618

awestroke 5 years ago | | |

JS does not have mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

arc776 5 years ago | | |

The thing about Javascript is it's actually a very simple language. You can make a lot of guarantees and this means performance patterns can be implied.

Ultimately JS can be reduced to a very tight engine. This is not possible with Python, it's just too dynamic.

arc776 5 years ago | |

I completely agree. Everything is baked in to be slow. There is no way around it, I don't think you can write super fast interpreters like with Javascript - I might be wrong, but so far it hasn't happened.

For general use cases the performance is fine, but only thanks to the hard work of C/CPython/Cython programmers who give up Python's rich expressibility to gain this performance. It seems like you simply have to use another language to get anything running fast.

Having said all that, Pyc seems interesting as it apparently compiles Python. Has anyone had any experience of this?

chrisseaton 5 years ago | | |

> There is no way around it

What aspects of the language are you convinced cannot be optimised? There's tons of research in this area.

mikepurvis 5 years ago | |

It's true that there are certain non-negotiable costs there, and projects like Mercurial have invested heavily in trying to figure out how to make Python start up faster, and basically hit a brick wall (see: https://www.mercurial-scm.org/wiki/PerformancePlan).

That said, for a lot of other projects which haven't yet looked, there may be some low-hanging fruit. For example, I was doing some looking at this recently on a highly pluggable workspace build tool called colcon [1], and found that of 5+ seconds of startup time, I could save about 1 second with "business logic" changes (adding caching to a recursive operation), another 1 second by switching some filesystem operations to use multiprocessing, and about 1.5 seconds from making some big imports (requests, httpx, sanic) happen lazily on first use.

[1]: https://github.com/colcon/colcon-core/issues/398

jnxx 5 years ago | |

That's really surprising if one considers for a moment how many things Python has in common with Common List, a language which can be compiled to run near C speed (albeit with some sacrifices on safety i.e. "unsafe" optimizations). And if anything, Python 3 has become more similar to Lisp, while running at 1 / 20 of its speed.

beagle3 5 years ago | | |

Python does have a lot in common with CL; but the problem with Python is that almost any call you cannot statically inline, which is most of them, can change the semantics of everything else - you've just called math.floor() ; are you sure it wasn't just monkeypatched to assign 7 to all local variables who have an 'x' in their name in the caller's frame?

Most of these uses are very rare, but the tail is incredibly long for Python, and the problem is that you can't even compile a "likely normal" and a "here be dragons" versions, and switch only when needed - you need to constantly verify. The same is not true, AFAIK, with Common Lisp - being a lisp1 and having a stronger lexical scope than python does.

Shedskin is a Python to C++ compiler that mostly requires the commonly-honoured constrained that a variable is only assigned a single type throughout its lifetime. (And that you don't modify classes after creation, and that you don't need integers longer than machine precision, and ....); While many programs seem to satisfy these requirements on superficial inspection, it turns out that almost all programs violate them in some way (directly or through a library).

The probability that Shedskin will manage to compile a program that was not written with Shedskin in mind is almost zero.

Nuitka was started with the idea that, unlike shedskin, it will start by compiling the bytecode to an interpreter-equivalent execution (which it does, quite well), to get a minor speed up - and then gradually compile "provably simple enough" things to fast C++; a decade or so later, that's not working out as well as hoped, AFAIK because everything depends on something that violates simplicity.

chrisseaton 5 years ago | |

> That’s mutable interpreter frames, global interpreter locks, shared global state, type slots, the C ABI.

There's research towards solving all of these problems.

> The only way to speed it up would be to change the language.

Maybe we just haven't worked out how yet? Nothing you've mentioned is known to be impossible to make fast.

snicker7 5 years ago | |

Python is ~100x slower than C. There is definitely wiggle room for improvement.

Rochus 5 years ago | |

How would you explain then that LuaJIT is so much faster than CPython? Even the interpreter of LuaJIT is much faster.

> The only way to speed it up would be to change the language.

What specifically? Most of your points are not related to the language. And even current Smalltalk engines are much faster than CPython (see https://github.com/OpenSmalltalk/opensmalltalk-vm).

jashmatthews 5 years ago | | |

Lua doesn’t have assignment as an expression. Lua 5.1 has float as the only numeric type. Lua varargs are easier to implement.

Each VM op for Python or Ruby ends up being bigger and having more branches. For Ruby this is quite painful on the numeric types. Branching, boxing and unboxing is far slower than just testing and adding floats in the LuaJIT VM.

Due assignment as an expression and things like x = foo(x, x+=1) Ruby, Python and JS all need to copy x into a new VM Register when it’s used. LuaJIT can assume locals aren’t reassigned mid statement and doesn’t need copies.

fanf2 5 years ago | | |

Lua exposes much less of its internals than Python. For example the comment you replied to mentioned stack frames which are not exposed in Lua.

stephc_int13 5 years ago | |

Javascript or PHP were not designed to be fast as well.

Oh, wait...

jerf 5 years ago | | |

Most languages designed in that era were not designed to be fast, and none of them were designed to be fast on 2020-era processors. The former is because this was the era of exponential CPU growth, and the latter because as good as many of these language designers were, none of them were psychic.

I'm pretty sure both Guido for Python and Larry for Perl were explicitly aware of the impossibility of designing for processors that wouldn't exist for 20 years, though digging up quotes to that effect would be quite difficult.

A mantra of that era is "There are no slow languages, only slow implementations." I, for one, consider this mantra to be effectively refuted. Even if there is a hypothetical Python interpreter/compiler/runtime/whatever that can run effectively as fast as C with no significant overhead (excepting perhaps some reasonable increase in compile time), there is no longer any reason to believe that mere mortal humans are capable of producing it, after all the effort that has been poured into trying, as document by the original link. Whatever may be true for God or superhuman AIs, for human beings, there are slow languages that build intrinsically slow operations into their base semantics.

skohan 5 years ago | |

> C ABI

Why should this make python slow?

chrisseaton 5 years ago | | |

If you have to meet an existing ABI then you're constrained in how you can optimise.

moralsupply 5 years ago | |

That's not correct. Python will never be as fast as hand-optimized assembler, but it certainly can be much (5-10x) faster that what it is right now for most workloads. Pypy is a living proof that it can be done.

dralley 5 years ago | | |

You're arguing with mitsuhiko, he's given entire talks on this subject.

https://www.youtube.com/watch?v=qCGofLIzX6g&t=31m44s

PyPy is faster for pure Python code, but that comes at the expense of having a far slower interface with C code. There's an entire ecosystem built around the fact that while Python itself is slow, it can very easily interface with native code (Numpy, Scipy, OpenCV) with very little overhead.

So sure, you can make Python much faster, if you're willing to piss off the very Python users who care the most about performance in the first place (the data science / ML people and anyone else using native extensions).

bastawhiz 5 years ago |

Ultimately, at least IMO, no attempt to speed up python will succeed until the issue of Python's C API is addressed. This is arguably Pypy's only major barrier: if you can't run the software on it, you're not going to use it. Pyston was arguably the most serious attempt at fast python while maintaining compatibility with the API, but DBX clearly didn't see the RoI they were hoping to.

It's looking like HPy is going to (hopefully) solve this. But finishing HPy and getting it adopted is likely to be a pretty massive undertaking.

intrepidhero 5 years ago |

What I really want for python is a knob to improve startup time. I've imagined there must be a way to "statically link dependencies so that import isn't searching the disk but just loading from a fixed location/file. There doesn't seem to be many resources on the net. I've found this one: https://pythondev.readthedocs.io/startup_time.html. I tried using virtualenvs to limit my searchable import paths, and messed around with cython in effort to come up with a static linked binary. But I've yet to come up with anything that really improves the startup time. Clearly I have no idea what I'm doing.

joncatanio 5 years ago |

Not trying to self-promote, but this might be of interest to you. It's not a fully flushed out implementation, but my project analyzed specific language features that affect performance: https://github.com/joncatanio/cannoli

hydroxideOH- 5 years ago | |

> Leave the features: Take the cannoli

Now that's how you title a thesis paper.

ramraj07 5 years ago | |

Can't stop laughing at the most germane name that project could ever have.

joncatanio 5 years ago | | |

Needed something to chuckle at during my work ha!

sethgecko 5 years ago |

Yuri Selivanov tweeted yesterday that Python 3.10 will be "up to 10% faster" https://twitter.com/1st1/status/1318558048265404420

yxhuvud 5 years ago | |

Wait, python doesn't have any method lookup caching before this? I would have expected that developers looked at what other similar languages are doing, but apparently not enough.

saeranv 5 years ago | |

Am I correct that 3.10 comes after 3.9? How does that make sense, shouldn't it increase to 4.x? Is there an actual 3.1 (coming after 3.0) that this conflicts with?

eznzt 5 years ago | | |

Version numbers are not decimal numbers, they are read like the chapters of a book: 3.10 (chapter 3 section 10) comes after 3.9 (chapter 3 section 9)

theandrewbailey 5 years ago | | |

10 comes after 9, so 3.10 comes after 3.9. There's no major changes that would warrant 3.x to 4.0. It's just the 10th big release after 3.0.

Yes, there was a Python 3.1: https://www.python.org/download/releases/3.1/

stuaxo 5 years ago | |

That's pretty good, in optimisation, 5% at a time is a good win.

centimeter 5 years ago | |

That seems pretty small compared to the huge gap between python and basically any compiled language.

willseth 5 years ago |

The list should probably also include mypyc: https://github.com/python/mypy/tree/master/mypyc

Twirrim 5 years ago |

Another one missing from that list is Graalpython, https://github.com/graalvm/graalpython. It's in early stages of implementation, aimed at being python3 on top of GraalVM.

1wd 5 years ago |

One more: https://github.com/microsoft/Pyjion

forgotpwd16 5 years ago | |

On the repo there's also a comparison[1] with some of the other implementations.

[1]: https://github.com/microsoft/Pyjion#how-do-this-compare-to-

sitkack 5 years ago | | |

I find it really interesting, that not only did they do the work of creating a JIT using the CoreCLR for CPython, they created a JIT API so that their system is augmenting CPython and not taking it over. Solid engineering.

This also means that one could implement an alternative JIT using Rust or OCaml.

https://github.com/microsoft/Pyjion#what-are-the-goals-of-th...

Naac 5 years ago |

This article appear to be a list of python interpreters.

Not all of these were designed for speed,l. For example jython was also intended for Java/python interoperability.

Some of the interpreters on the list haven't seen updates in a while, or don't support python 3.x

nknealk 5 years ago | |

Numba is actually a pretty interesting project. It allows you to JIT compile a single function with a decorator. Static typing required, and it plays nice with numpy. They’ve also got some interesting stuff going on that lets you interface with nvidia GPUs as well.

Highly recommend it for anyone doing scientific computing

sitkack 5 years ago | | |

I agree, Numba is awesome for lots of reasons. The biggest advantage for everyone, the Numba team as well as its users, is that it is opt-in and done with intent. The programmer is saying, "I am willing to constrain my code to get perf". And that you can do that inside an existing runtime is pretty damn cool.

I think for Python to get decent speedups the semantics for the code being optimized needs to be highly constrained.

Optimizing full in the wild Python code is a huge huge task. Optimizing for operations over constant type arrays is much much easier.

Yes this doesn't speed up the call or the allocation rate, but start with some easy stuff or nothing will improve.

michelpp 5 years ago | | |

Numba is indeed a great library for speeding up Python and also doing other useful things when interacting when external libraries.

For example the "jit compile a single function" feature is gold when you need to pass a function callback pointer into a C library. This is how pygraphblas compiles Python functions into semiring operators that are then passed to the library which has no idea that the pointer is to jit compiled python function:

https://github.com/michelp/pygraphblas/blob/master/tests/tes...

weakfish 5 years ago | | |

Forgive my ignorance, I'm not super knowledgable on the subject but does this mean you just add decorators to existing functions with typing and it enhances the speed?

acomjean 5 years ago |

I tend to use Python for batch jobs and things where its speed isn't that important to me. Am I alone in this?

When I reach for python its not for speed. Its because its fairly easy to write and has some good libraries.

Either its done in a few seconds, or I can wait a few hours as it runs as a background slurm task..

I feel like there is a group that wants python to be the ideal language for all things, maybe because I'm not in love with the syntax, but I'm ok having multiple languages.

nemothekid 5 years ago | |

Many people don't start with Python for speed. They are exactly like you - they write a script that is done in few seconds. Then the data scales, then it takes a few minutes. Then you need it to be faster, and now you either need to rewrite the script. It would be helpful if you didn't need to make this choice.

ufo 5 years ago |

IIRC Psyco was a precursor to PyPy. Armin Rigo was involved in both.

beervirus 5 years ago | |

Psyco was great. Add two lines of code, suddenly everything is (at least a little, often a lot) faster.

xioxox 5 years ago | | |

It was great. It showed that it is actually possible to run fast Python from within the standard interpreter with excellent compatibility. The only downside I remember was the memory usage.

chrisseaton 5 years ago | | |

Why isn't everyone using it then?

arc776 5 years ago |

I gave up trying to make Python fast since to do so you give up what makes Python good and end up writing C/Cython. On top of this, distributing Python is just... gross, at least for my use cases.

Eventually I found Nim and never looked back. Python is simply not built for speed but for productivity. Nim is built for both from the start. It's certainly lacking the ecosystem of Python, but for my use cases that doesn't matter.

zanellia 5 years ago |

In my opinion there is some potential there. Especially exploiting the increasing integration of typing-oriented features (i.e. type annotations) and the interest in using those to carry out static analysis (e.g. in mypy, but also Facebook's Pyre and Microsoft's Pyright and many other), it might be possible to speed up execution times a bit. This is especially true if we restrict the attention to a restricted subset of Python as, e.g., within domain specific languages. It might not make sense to entirely reverse engineer a language that was designed to be duck-typed into a statically typed one. However, for some domain specific applications I find performance oriented static analysis an interesting tool.

To make it more concrete, here is an experimental DSL for embedded high-performance computing that uses static analysis and source-to-source (Python-to-C, actually) code transformation: https://github.com/zanellia/prometeo.

overgard 5 years ago |

I don't know much about the other ones, but I think you'd have to say PyPy has been a success. Although to be honest, I don't know why it would be better to modify CPython vs. just using PyPy -- the JIT speedup does come with some tradeoffs (memory usage, warmup times), so it seems better just to leave that decision up to the user?

thelazydogsback 5 years ago |

It amazes me that the stack-entwined implementation with the GIL remained the canonical version this whole time -- I would think that the Stackless version (or similar) would have been the default long-ago. This really should have made it worth it from a 2.x to 3.x version perspective, even if many people had to rewrite their extensions, and even if some monkey-patching were removed from the language in favor of more disciplined meta-programming.

beagle3 5 years ago | |

Stackless still uses the GIL; But it avoids using the C stack most of the time, which opens the door to green threads (of which you can have a lot more than OS threads), suspending processes (dump/undump style, except portably), coroutines and more.

There were a couple of GIL-less variations, but they were either incredibly slow, or suffered serious compatibility problems (and often both).

zellyn 5 years ago |

Forgot one : https://github.com/google/grumpy

Boxxed 5 years ago |

Whatever happened to psyco? I remember it pretty much just working without any hassle and actually providing a noticeable speedup. All the mindshare is now on PyPy -- it's received enormous amounts of engineering and still seems very rough around the edges.

thelazydogsback 5 years ago | |

psyco worked well for me at the time as well -- I remember doing something with it and pyGame, FWIR.

est 5 years ago |

The HotPy listed by OP is done by Mark Shannon, the same person of today's proposed 5x speedup

Also, some relevant old post:

https://news.ycombinator.com/item?id=17107047