Attempts to make Python fast(sethops1.net) |
Attempts to make Python fast(sethops1.net) |
The only way to speed it up would be to change the language.
First of all that's only true when it managed to jit the code, secondly only until you try to do any of those slow things. For instance the C ABI emulation they have both cannot support all of CPython and wrecks performance. The same is true if you try to do fancy things with sys._getframe which a lot of code does in the wild (eg: all of logging).
In addition PyPy has to do a lot of special casing for all the crazy things CPython does. I recommend looking into the amount of engineering that went into it.
https://youtu.be/qCGofLIzX6g https://youtu.be/IeSu_odkI5I
I wish these were the things Python 3 addressed, rather than Unicode. I guess it's much more obvious in hindsight than back when Python 3 was designed.
Python's still a great language for the things it was being designed for back in the 2000s. But adding decent Unicode support is a big part of what helped it become an attractive language for use cases where I wish it performed better or had better support for parallelism. Natural language processing, for example.
A point made in the video that seems to highlight the issue:
> Just adding two numbers requires 400 lines of code.
In compiled languages, this is one instruction! Think about the cache thrashing and memory loading involved in this one operation too. How can this possibly be fixed?
Python is a great language, but I don't know if it can ever be high performance on its own.
Never the best tool if you have strict performance requirements, but so damn versatile it will never go away.
Cython does need better docs though, the steep learning curve means it is under-utilized.
For some that glue is Forth. :D
> A guy named Jean-Paul Wippler is considering using Forth as a super glue language to bind Python Perl and Tcl together in a project called Minotaur (http://www.equi4.com/minotaur/minotaur.html).
> Forth is an ideal intermediary language, precisely because it's so agile. Otherwise, it wouldn't have been chosen for OpenFirmware, which when you think about it, is a Forth system that must interface to a potentially wide variety of programming language environments.
Also see PyPy, which manages to squeeze a lot more performance out of Python for many use cases without changing the language.
> This is a common view but I've never heard it from someone who has tried to optimize Python. Personally I think that Python is as much more dynamic than JavaScript as JavaScript is than C.
Ultimately JS can be reduced to a very tight engine. This is not possible with Python, it's just too dynamic.
For general use cases the performance is fine, but only thanks to the hard work of C/CPython/Cython programmers who give up Python's rich expressibility to gain this performance. It seems like you simply have to use another language to get anything running fast.
Having said all that, Pyc seems interesting as it apparently compiles Python. Has anyone had any experience of this?
What aspects of the language are you convinced cannot be optimised? There's tons of research in this area.
That said, for a lot of other projects which haven't yet looked, there may be some low-hanging fruit. For example, I was doing some looking at this recently on a highly pluggable workspace build tool called colcon [1], and found that of 5+ seconds of startup time, I could save about 1 second with "business logic" changes (adding caching to a recursive operation), another 1 second by switching some filesystem operations to use multiprocessing, and about 1.5 seconds from making some big imports (requests, httpx, sanic) happen lazily on first use.
Most of these uses are very rare, but the tail is incredibly long for Python, and the problem is that you can't even compile a "likely normal" and a "here be dragons" versions, and switch only when needed - you need to constantly verify. The same is not true, AFAIK, with Common Lisp - being a lisp1 and having a stronger lexical scope than python does.
Shedskin is a Python to C++ compiler that mostly requires the commonly-honoured constrained that a variable is only assigned a single type throughout its lifetime. (And that you don't modify classes after creation, and that you don't need integers longer than machine precision, and ....); While many programs seem to satisfy these requirements on superficial inspection, it turns out that almost all programs violate them in some way (directly or through a library).
The probability that Shedskin will manage to compile a program that was not written with Shedskin in mind is almost zero.
Nuitka was started with the idea that, unlike shedskin, it will start by compiling the bytecode to an interpreter-equivalent execution (which it does, quite well), to get a minor speed up - and then gradually compile "provably simple enough" things to fast C++; a decade or so later, that's not working out as well as hoped, AFAIK because everything depends on something that violates simplicity.
There's research towards solving all of these problems.
> The only way to speed it up would be to change the language.
Maybe we just haven't worked out how yet? Nothing you've mentioned is known to be impossible to make fast.
> The only way to speed it up would be to change the language.
What specifically? Most of your points are not related to the language. And even current Smalltalk engines are much faster than CPython (see https://github.com/OpenSmalltalk/opensmalltalk-vm).
Each VM op for Python or Ruby ends up being bigger and having more branches. For Ruby this is quite painful on the numeric types. Branching, boxing and unboxing is far slower than just testing and adding floats in the LuaJIT VM.
Due assignment as an expression and things like x = foo(x, x+=1) Ruby, Python and JS all need to copy x into a new VM Register when it’s used. LuaJIT can assume locals aren’t reassigned mid statement and doesn’t need copies.
Oh, wait...
I'm pretty sure both Guido for Python and Larry for Perl were explicitly aware of the impossibility of designing for processors that wouldn't exist for 20 years, though digging up quotes to that effect would be quite difficult.
A mantra of that era is "There are no slow languages, only slow implementations." I, for one, consider this mantra to be effectively refuted. Even if there is a hypothetical Python interpreter/compiler/runtime/whatever that can run effectively as fast as C with no significant overhead (excepting perhaps some reasonable increase in compile time), there is no longer any reason to believe that mere mortal humans are capable of producing it, after all the effort that has been poured into trying, as document by the original link. Whatever may be true for God or superhuman AIs, for human beings, there are slow languages that build intrinsically slow operations into their base semantics.
Why should this make python slow?
https://www.youtube.com/watch?v=qCGofLIzX6g&t=31m44s
PyPy is faster for pure Python code, but that comes at the expense of having a far slower interface with C code. There's an entire ecosystem built around the fact that while Python itself is slow, it can very easily interface with native code (Numpy, Scipy, OpenCV) with very little overhead.
So sure, you can make Python much faster, if you're willing to piss off the very Python users who care the most about performance in the first place (the data science / ML people and anyone else using native extensions).
It's looking like HPy is going to (hopefully) solve this. But finishing HPy and getting it adopted is likely to be a pretty massive undertaking.
Now that's how you title a thesis paper.
Yes, there was a Python 3.1: https://www.python.org/download/releases/3.1/
[1]: https://github.com/microsoft/Pyjion#how-do-this-compare-to-
This also means that one could implement an alternative JIT using Rust or OCaml.
https://github.com/microsoft/Pyjion#what-are-the-goals-of-th...
Not all of these were designed for speed,l. For example jython was also intended for Java/python interoperability.
Some of the interpreters on the list haven't seen updates in a while, or don't support python 3.x
Highly recommend it for anyone doing scientific computing
I think for Python to get decent speedups the semantics for the code being optimized needs to be highly constrained.
Optimizing full in the wild Python code is a huge huge task. Optimizing for operations over constant type arrays is much much easier.
Yes this doesn't speed up the call or the allocation rate, but start with some easy stuff or nothing will improve.
For example the "jit compile a single function" feature is gold when you need to pass a function callback pointer into a C library. This is how pygraphblas compiles Python functions into semiring operators that are then passed to the library which has no idea that the pointer is to jit compiled python function:
https://github.com/michelp/pygraphblas/blob/master/tests/tes...
When I reach for python its not for speed. Its because its fairly easy to write and has some good libraries.
Either its done in a few seconds, or I can wait a few hours as it runs as a background slurm task..
I feel like there is a group that wants python to be the ideal language for all things, maybe because I'm not in love with the syntax, but I'm ok having multiple languages.
Eventually I found Nim and never looked back. Python is simply not built for speed but for productivity. Nim is built for both from the start. It's certainly lacking the ecosystem of Python, but for my use cases that doesn't matter.
To make it more concrete, here is an experimental DSL for embedded high-performance computing that uses static analysis and source-to-source (Python-to-C, actually) code transformation: https://github.com/zanellia/prometeo.
There were a couple of GIL-less variations, but they were either incredibly slow, or suffered serious compatibility problems (and often both).
Also, some relevant old post:
I love the idea of typed base language to implement a higher level more flexible language while still being able to drop down for correctness and speed. Gradually dynamically typed, ;)
Another thing to look at is https://chocopy.org/ a typed subset of Python for teaching compilers courses. Might be worthwhile pinging Chocopy students and enticing them towards epython.
What is the semantic union and intersection between EPython and Chocopy?
I think the approach where a typed subset of Python is used to compile a fast extension module is the way forward for Python. This would leave us with a slow but dynamic high-level-variant (CPython) and typed lower-level-variant (EPython, mypyc & co) to compile performant extension modules, which you can easily import into your CPython code.
The most prominent of such projects I know of is mypyc [0], which is already used to improve performance for mypy itself and the black [1] code formatter. I think it would be interesting to see how EPython compares to mypyc.
The C API is what prevents PyPy or other Python runtimes from being able to compete and interop. The community could do this, rebase Python modules with native code to cffi so that they can run in all Pythons. The C API is neither good, nor necessary and only serves to gate keep CPython's access to the rest of the Python user community.
It's a bit like python3 from python2, it's been so slow and painful to transition because you cannot just "drop all your code" (I'm simplifying the issue).
Doesn't look like that from over here.
Many times the difference between failure and the magic spell working is 1 more late night iteration. In this specific case you are working against some difficult constraints that are deep in the language. That said, there is almost always a way to side-step a problem altogether. You may find that one workaround is to amortize the startup concern over time - I.e. reorient the problem domain so you only have start the python process once a day. Or, find a way to defer loading of required components until the runtime actually needs them.
However, idiomatic Python shortcuts to expose everything at the top level (star imports or imports of everything in the top-level __init__.py) cause everything to be imported everywhere. __all__ is all but forgotten, so importing things like flask, sqlalchemy, requests and similar will take anywhere from 100-500ms each, even if you just need a single function from a submodule.
Worst offenders are things which embed their own copy of requests (likely for reproducible builds) taking upwards of 800ms just to import even if your project already imported requests directly.
I don't think it has anything to do with search paths, but simply with loading and executing hundreds of files. If you need those modules, Python will read them. Perhaps moving your venv to a "ramdisk" might help?
python -s [-S]And yeah the C ABI is slow, but that's true of practically every language. Again, it's a choice of if you use those things or not. That doesn't devalue making other parts of the language faster.
>12 March 2012
>Psyco is unmaintained and dead. Please look at PyPy for the state-of-the-art in JIT compilers for Python.
That's quite easy to achieve if you directly generate bytecode. See e.g. https://github.com/rochus-keller/som.
> Lua 5.1 has float as the only numeric type
It internaly differs between int and float.
> Ruby, Python and JS all need to copy x into a new VM Register when it’s used
Even the OpenSmalltalk VM is much faster than CPython, as well as V8.
This was part of a class project so not available online unfortunately. It’s good practice to implement it yourself though! There are lots of resources online for implementing fast allocators.
In my direct experience, the only people who waited until the bitter end (and beyond) were ops folks who never had to stray much outside of 7-bit ASCII, and companies with large existing codebases that didn't want to allocate the resources to migrating. Neither of those really have much to do with my assertion that Python 3 attracted new people doing new things.
Note that you can only look up variables by their bytecode register number, not by name.
from numba import jit
@jit
def jitted_fn():
https://numba.readthedocs.io/en/stable/user/jit.htmlCommon Lisp is a Lisp2.
It makes it easier to compile than a lisp1 (to which Python is closer), because the standard call form s-expression can be bound early.
FWIW, here's the relevant dispatch code in Python's ceval.c where you see it uses a very generic dispatching at that level, which eventually, deeper down, gets down to the "oh, it's an integer!"
case TARGET(BINARY_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
/* NOTE(haypo): Please don't try to micro-optimize int+int on
CPython using bytecode, it is simply worthless.
See http://bugs.python.org/issue21955 and
http://bugs.python.org/issue10044 for the discussion. In short,
no patch shown any impact on a realistic benchmark, only a minor
speedup on microbenchmarks. */
if (PyUnicode_CheckExact(left) &&
PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(tstate, left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
sum = PyNumber_Add(left, right);
Py_DECREF(left);
}
Py_DECREF(right);
SET_TOP(sum);
if (sum == NULL)
goto error;
DISPATCH();
}
Python code can be made more high performance if there's some way to tell the implementation the types, either explicitly or by inference or tracing. That's how several of those listed projects get their performance.In your example text processing in `unicode_concatenate` is going to be very, very much slower than a bulk load of the native numerical data directly from memory and processing it. For each character, Python needs to check a number is still a number at run time then convert the result to a native numeric. I can only assume this string processing is at worst performed once and cached(?), because otherwise it doesn't seem like it would run well at all and surely Python's bigint performance is pretty important.
> Python code can be made more high performance if there's some way to tell the implementation the types, either explicitly or by inference or tracing.
At that stage, I would just use Nim and get better performance and a decent static type system included and either call it from Python, or call Python from Nim.
Guess I could also have used 5j + 3 as a counter-example.
If this is an issue then at this stage, many Python people switch to use one of the alternatives mentioned here, like Cython, which is a Python-like language which includes a static type system (including support for C++ templates) and can easily generate C extensions that can call and be called from Python.
And well, most JS implementations do not have GIL because they are not multithreaded at all.
In fact I think that there are many relatively simple modifications that would make CPython significantly faster, but many such things conflict with each other in ways that make the resulting complexity not worth it.
A lot of people seem to have the mistaken impression that v8 makes Javascript "fast". It's "fast" for a dynamic language. But on general code... it's still slow. It seems to plateau around 10x slower than C, as with the other JIT efforts to speed up dynamic languages, with a roughly 5-10x memory penalty in the process.
Microbenchmarks like the benchmark game tend to miss this because a lot of microbenchmarks focus on numeric speed. But numeric code is easy mode for a JIT. Now, that's cool, and there's nothing wrong with that. If it's the sort of code you have, great! You win. But that performance doesn't translate to general code. These are not value judgments, these are just facts about the implementation.
I expect v8 is roughly as fast as JS is going to get, and it's now news if they can eke out a .5% improvement on general code.
You can also do much better with v8 if you program in a highly restricted subset of JS that it happens to be able to JIT very well. However, this is not really the same as writing in JS. It's an undocumented subset, it's a constantly changing subset, and there's not a lot of compiler support for it (I'm not aware of anything like a "use JITableOnly" or anything).
Because I don't need it to be hermetically-sealed perfection, I just need my python code to spit out a good result when I throw it at a problem; nevermind that it took a few seconds to spin up or needs more memory than a perfectly crafted C program.
Is this wishful thinking?
Perl is extinct in comparison. It's not been used for any projects anywhere for a long long time.
Which is a good example that the decrease in use can go a lot faster than you think. Perl was widely used in 2000, and thought to be on par with Python. Similarly Visual Basic which nobody seems to remember any more.
Also, COBOL is simply used because it is uneconomical to rewrite those old programs, not because it is a good language to write new stuff in. But the heavy dependency of Python programs on libraries hosted across the web means that obsolescence can happen a lot faster today; a COBOL program is almost totally self-contained in comparison.
Also saying the effort of PyPy is purely around speed is misleading. After all, another huge goal of the project was to implement a python interpreter in python, which they succeeded at.
It's in geometric mean, not average (see http://ece.uprm.edu/~nayda/Courses/Icom5047F06/Papers/paper4...). The same principle is applied to all testees. It's normal that certain benchmarks run faster than others. That's why we compare geometric means.
> Also comparing a volunteer project to an interpreter that has the resources of google behind it is IMO pretty unfair.
Didn't the project run for nearly twenty years with seveal rounds of EU funding? I think it's rather the approach than the team size or corporate support. See e.g. LuaJIT which was implemented by a single person in a shorter time frame and achieves similar performance like Node.js.
> Also saying the effort of PyPy is purely around speed is misleading
Didn's say that. But unfortunately also the other RPython based implementations also don't seem to be faster.
https://doc.pypy.org/en/release-1.9/index-report.html
https://ieeexplore.ieee.org/document/1667583
https://mail.python.org/pipermail/pypy-dev/2004-December/001...
> mutable interpreter frames, global interpreter locks, shared global state, type slots
On top of this, Python is extremely dynamic and nothing can be assured without running through the code. So this leads to needing JITs to improve performance which then give a slow start up time and increased complexity. Even with JIT, Python is just not fast thanks to the above issues and it's overall dynamism.
It can be optimised and for sure there's some impressive attempts at doing so. However I don't think pure Python will ever be considered "fast" as these things necessarily get in the way.
I highly recommend the two videos posted here that go into more detail as to why there are limits to how far optimisation can go: https://youtu.be/qCGofLIzX6g https://youtu.be/IeSu_odkI5I
I'd challenge the idea that there really are known 'limits'. As I say there's research towards this, these videos are old, and Armin and Seth may not be up to date with all of the literature (in fact I'm sure Seth is not, as he's missing at least one major current Python implementation research project from his blog post.)
There are good reasons why these limits cannot be overcome in that the complexity and dynamism of the language precludes it.
Being interpreted is one cost that sets a significant barrier to performance, and the dynamic complexity further compounds it. For example whereas JS is basically only functions, in Python you have a huge range of ways you can do incredibly complex things with slot wrappers, descriptors, and metaprogramming.
Ultimately, Python will get faster, but diminishing returns are inevitable. Python can never be as fast as the equivalent code in a compiled language. It simply has too much extra work to do.
Having the option to be slow startup/fast execution is a good option to have. Maybe not for some, but definitely needed by others.
These are only simple benchmarks, but do indicate a rough ballpark for TruffleRuby: https://github.com/kostya/benchmarks
As I understand it, Crystal would be a good Ruby alternative if you want performance. This is of course a whole new language designed with performance in mind from the beginning and here is a repeating theme: you need to consider performance at the start, not 20 years later.
I worked at the largest US bank and had the unfortunate task to decommission the last Perl software. Doing a lot of archeology there was never much of Perl really, some short scripts here and there. One or two flagship applications in the early 2000 but they were rewritten long ago.
Looking at our python codebase however, that's ten of millions of lines of code covering all types of applications and all aspects of the business. It will still be there 30 years later.
The dependency to the interpreter and external libraries is a problem indeed. They're constantly shifting or getting abandoned under your feet. I wonder how this will be managed eventually.
And if Google, Amazon, NASA, etc (all places where Python is used heavily), are not high enough for you, what exactly are your standards?
Not dog shit. Just because you can make money slinging dog shit doesn’t make it good
Crystal has wildly different semantics to Ruby, so it’s not a good alternative at all.
Can you give specific examples and prove that they cannot be overcome?
How much of the literature have you read?
I'll give you a concrete example of how I see these claims - people said monkey-patching in Python and Ruby was a hard overhead to peak temporal performance and fundamentally added a cost that could not be removed... turns out no that cost can be completely eliminated. I could give you a list of similar examples as long as you want.
Is it eliminated in any production interpreter/VM used by Python, Ruby or any other mainstream language?
I mean, it's nice if it's research, but if I'm a boring programmer churning out Enterprise Middleware using these languages, do I get to use it?
Or is it just a pre-alpha branch of PyPy that might be out in 2025, if we're lucky? :-)
But come on... we were arguing 'impossible' a second ago and now we're watered that down to 'not production ready'. We're making progress.
It's hard to prove a theoretical negative, but perhaps by comparison with the run time performance of static AOT (SAOT) compiled languages I can show what I mean.
Dynamic typing:
- Python requires dynamic type checking before any program work is done. SAOT doesn't need to do any run time work.
- Adding two numbers in Python requires handling run time overloads and a host of other complexities. SAOT is a single machine code instruction that requires no extra work.
Boxing values:
- Python value boxing requires jumping about in heap memory. SAOT language can not only remove this cost but reduce it to raw registers loaded from the stack and prefetched in chunks. This massively improves cache performance by orders of magnitude.
Determinism:
- In Python program operation can only be determined by running the code. In SAOT, since all the information is known at compile time programs can be further folded down, loops unrolled, and/or SIMD applied.
Run time overheads
- Python requires an interpretive run time. SAOT does not.
In summary: Python necessarily requires extra work at run time due to dynamic behaviour. SAOT languages can eliminate this extra work.
I do understand though that with JIT a lot of these costs can be reduced massively if not eliminated once the JIT has run through the code once. For example here they go through the painful process of optimising Python code to find what is actually slowing things down, to the point of rewriting in C: http://blog.kevmod.com/2020/05/python-performance-its-not-ju...
At the end they point out that PyPy gives a very impressive result that is actually faster than their C code. Of course, this benchmark is largely testing unicode string libraries rather than the language itself and I'd argue this is an outlier.
> How much of the literature have you read?
Literature on speeding up Python or high performance computing? The former, very little, the latter, quite a lot. My background is in performance computing and embedded software.
I'm definitely interested in the subject though if you've got some good reading material?
> people said monkey-patching in Python and Ruby was a hard overhead to peak temporal performance and fundamentally added a cost that could not be removed... turns out no that cost can be completely eliminated.
This really surprised me. Completely eliminated? I'm really curious how this is possible. Do you have any links explaining this?
My PhD's a good starting point on this subject https://chrisseaton.com/phd/, or I maintain https://rubybib.org/.
> This really surprised me. Completely eliminated? I'm really curious how this is possible. Do you have any links explaining this?
Through dynamic deoptimisation. Instead of checking if a method has been redefined... turn it on its head. Assume it has not (so machine code is literally exactly the same as if monkey patching was not possible), and get threads that want to monkey patch to stop other threads and tell them to start checking for redefined methods.
This is a great example because people said 'surely... surely... there will always be some overhead to check for monkey patching - no possible way to solve this can't be done' until people found the result already in the literature that solves it.
As long as you are not redefining methods in your fast path... it's literally exactly the same machine code as if monkey patching was not possible.
> get threads that want to monkey patch to stop other threads and tell them to start checking for redefined methods
As an aside, this sort of reminds me of branch prediction at a higher level. A very neat way to speed up for the general case of no patching.
> This is a great example because people said 'surely... surely... there will always be some overhead to check for monkey patching - no possible way to solve this can't be done' until people found the result already in the literature that solves it.
There is still overhead when patching is used though. If you don't use the feature, you don't pay the cost, however when monkey patching is used there is a very definite cost to rewriting the JIT code and thread synchronisation that compiled languages would simply not have.
I can see where you're coming from here. If we can reduce all dynamic costs that aren't used to nothing then we will have the same performance as, say, C. At least, in theory.
It would be certainly be interesting to see a dynamic language that can deoptimise all its functionality to a static version to match a statically compiled language. Still, any dynamic features would nevertheless incur an increased cost over static languages.
It's the dynamism itself of Python that incurs the performance ceiling over static compilation, plus the issues I mentioned in my previous reply about boxing and cache pressures. However you've definitely given me some food for thought over how close the gap could potentially be.
Turns out the lookup is a dictionary in raw Python is two order magnitudes slower than an equivalent hashmap lookup in Java.
Once the numbers came in, we realized we had to choose the right language for the job.
Python is great but it's not the end-all, be-all.
Yes I agree often it's better to rewrite in a different language if you can.
But if people tell me they want to program in Python or Ruby and they tell me it's worth it for them... then let's make it as fast as we can for them.