Python extensions should be lazy(gauge.sh) |
Python extensions should be lazy(gauge.sh) |
However, the conclusion is debatable. Not everyone has this problem. Not everyone would benefit from the same solution.
Sure, if your data can be loaded, manipulated, and summarized outside of Python land, then lazy object creation is a good way to go. But then you're giving up all of the Python tooling that likely drove you to Python in the first place.
Most of the Python ecosystem from sets and dicts to the standard library is focused on manipulating native Python objects. While the syntax supports method calls to data encapsulated elsewhere, it can be costly to constantly "box and unbox" data to move back and forth between the two worlds.
I completely take your point that there are many places where this approach won't fit. It was a surprise for me to trace the performance issue to allocations and GC, specifically because it is rare.
WRT boxing and unboxing, I'd imagine it depends on access patterns primarily - given I was extracting a small portion of data from the AST only once each, it was a good fit. But I can imagine that the boxing and unboxing could be a net loss for more read-heavy use cases.
The analogy with numpy doesn’t seem quite right, as Raymond observes, because numpy depends on lots of builtin operations that operate on the underlying data representation. We don’t have any such code for the AST. You’ll still want to write Python code to traverse, inspect, and modify the AST.
Everyone would benefit from developers being more performance minded and not doing uneccesarry work though! Especially Python who has long suffered with performance issues.
Love your work btw!
Python is python because people cared about other things for many years.
When linking to code on GitHub in an article like this, for posterity, it’s a good idea to link based on a specific commit instead of a branch.
It might be a good idea to change your link to the `Py_CompileStringObject()` function in CPython’s `Python/pythonrun.c` [0] to a commit-based link [1].
[0]: https://github.com/python/cpython/blob/main/Python/pythonrun...
[1]: https://github.com/python/cpython/blob/967a4f1d180d4cd669d5c...
Side note: Your tool, Tach, seems interesting…you might want to ask @dang [0] via email [1] if he’d be willing to add your submission [2] to the second-chance pool [3] (maybe also provide a clearer and more technical explanation of the tool and its key features).
[0]: https://news.ycombinator.com/user?id=dang
[1]: mailto:hn@ycombinator.com
You could make the API transparently lazy, i.e. ast.parse creates only one AstNode object or whatever and when you ask that object for e.g. its children those are created lazily from the underlying C struct. To preserve identity (which I assume is something users of ast are more likely to rely on than usual) you'd have to add some extra book-keeping to make it not generate new objects for each access, but memoize them.
That is to say, ast.NodeVisitor living in Python is part of the problem for use cases like mine. I need the extension to own the traversal as well so that I can avoid building objects except for the result set (which is typically a very small subset). That was what led me to imagine a query-like interface instead, so that Python can give concise traversal instructions.
In the first iteration of the Rust extension, I actually used the parser from RustPython. Although I can't find it at the moment, I think the RustPython parser was actually benchmarked as worse than the builtin ast parse (when both returned Python objects).
Even with this parser, IIRC the relevant code was around 8-11x faster when it avoided the Python objects. Apart from just the 35% spent in GC itself, the memory pressure appeared to be causing CPU cache thrashing (`perf` showed much poorer cache hit rates). I'll admit though that I am far from a Valgrind expert, and there may have been another consequence of the allocations that I missed!
The key for optimizing a Python extension is to minimize the number of times you have to interact with Python.
A couple of other tips in addition to what this article provides:
1. Object pooling is quite useful as it can significantly cut down on the number of allocations.
2. Be very careful about tools like pybind11 that make it easier to write extensions for Python. They come with a significant amount of overhead. For critical hotspots, always use the raw Python C extension API.
3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.
jemalloc also gave good results with NodeJS and Ruby projects i did.
But I couldn't help but notice that when `_PyCompile_AstOptimize` fails (<0), then `arena` is never freed. I think this is bug :thinking:.
I have always loved how the trick to making Python better eventually comes down to not writing Python.
The problem time immortal is language complexity vs the ability to hint to your compiler that it can make much stronger assumptions about your code than it has to assume naturally which is where we got __slots__. And there's lots of easy wins you could get in Python that eliminate a significant amount of dynamism-- you could tell your compiler that you'll never shadow names for example, that this list is fixed size, that you don't want number promotion but they all require adding stuff to the language to indicate that.
When you're looking from the bottom up you end up making different trade-offs. Because while you get nice primitives that generate very tight assembly when you need that dynamism you end up having this object model that exists in the abstract that you orchestrate and poke at but don't really see, like gobject. Ironically, HN's love-to-hate language C++ gives you both simultaneously but at the cost of a very complicated language.
Another strategy is to actually serve your users
I hadn't considered object pooling in this context, it might be more involved since each node has distinct data but for my use case it might still be a performance win.
Have you ever used pyo3 for rust bindings? I haven't measured the overhead but I have been assuming that it's worth the tradeoff vs. rolling my own.
(I'm the author)
I wouldn't take away from that observation that pyo3 is slow (it was just a poor fit; FFI for miniscule amounts of work), but the fact that the binding costs were higher than vanilla Python computations suggests that the overhead is (was?) meaningful. I don't know how it compares to a hand-written extension.
Agrees on the broader point (and I don't like pybind11 that much anyway), but the raw Python C extension API is often hard to use correctly. I would suggest that you should at least have a rough idea about how higher-level libraries like pybind11 would translate to the C API, so that you can recognize performance pitfalls in advance.
> 3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.
Or use the `array` module in the standard library if that matters. numpy is not a small library and has quite a large impact on the initial startup time. (Libraries like PyTorch are even much worse to be fair, though.)
I think the buffer interface is too complex to provide directly to users. I think an API that returns numpy arrays is simpler and easier to understand.
When I referenced numpy, I was thinking of a query layer which could push traversal into the extension as well. Something that could have given me “.select(ast.Import).all()”, which in my head is kind of like doing a filtered sum in numpy.
Very cool to get your thoughts on this, thanks for making an account :)
I disagree with the inclination to jump to numpy. I much prefer minimizing 3rd party libraries, especially if the performance is equivalent or nearly so.
The downside is they’re on the more esoteric side of Python, so people may not be as familiar with them as other structures.
Arrays implement the buffer interface so they can be used efficiently with tools like numpy.
Wrong. With strong types you do have ability to tell the compiler that most of the dynamic checks and hooks can be omitted, and values can stay unboxed. Python, ruby, perl choose to ignore types so far. And Javascript, PHP did at least dynamic optimizations with their type hints.
> Wrong
Can you give examples of languages that achieve both? Or is this all just on a spectrum? Like if we say C# is performant, it’s still the case that (eventually) the way to make it faster is “stop writing C#”.
That has stopped being true a few years ago and in some cases was never true.
The way for a faster C# codebase is writing faster C#. .NET CoreLib including all performance-sensitive paths like memmove is written in pure C#, and the VM (by that I mean all reflection bits, TypeLoader, etc.) itself, excluding GC, is also pure C# when you are using NativeAOT.
The optimization techniques to achieve this are but not limited to using monomorphized struct generics, stack buffers, arenas and memory pooling, using SIMD API (which has the same performance characteristics as intrinsics in C/C++), not allocating by using structs or making object lifetime GC friendly if allocations cannot be avoided, making otherwise safe code bounds check elision friendly, reducing indirection, etc. Many of these are exact same as what you would do in languages like C, C++ or Rust.
As a result, the use of FFI to call into C/C++/Rust/ObjC/Swift(upcoming native Swift ABI support)/etc. today is predominantly relegated to accessing necessary OS APIs and libraries.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Of course most of these optimizations are at odds with "dynamism" and yield the speed-up by making the dispatch static and inlining-friendly, and giving the compiler more information it can prove and make use of. Not to mention C# (together with Swift) sits the closest to the metal among otherwise high-level languages by virtue of what .NET is and what its ILC and JIT compile IL to.
It's been a few years, so I really can't tell you exactly what the problem was (other than the general observation that you should try to do nontrivial amounts of work in your python extensions rather than trivial amounts), but PyO3 agrees with the general sentiment [0] [1], or at least did at roughly the same time I was working there.
"Providing a quality implementation of matrix multiplication is highly non-trivial. Naive nested loop implementations are very slow and shipping such an implementation in CPython would just create a trap for users. But the alternative – providing a modern, competitive matrix multiply – would require that CPython link to a BLAS library, which brings a set of new complications. In particular, several popular BLAS libraries (including the one that ships by default on OS X) currently break the use of multiprocessing."
That's the second benefit of focus on perfs.
We could do with less, better tested, and faster, features in most apps I know of.
You can, but each added focus degrades the quality of the others.
The key principle is thinking with a mindset of cost.
Even if it’s low hanging fruit, there’s a world of difference between assuming we can work it in, and saying, “this is what it will cost, and this is what we won’t work on result”.
And similarly, saying ”that’s impossible” is not in the same universe as “the cost is extremely high and it’s not important enough to us to pay that cost”.
People seem to have taken the quote, “premature optimization is the root of all evil” to mean “don’t focus on performance until you absolutely have to,” but when you push them to prove that they don’t have to, they often haven’t even profiled their code! Cloud compute – and especially K8s – has made it such that it’s expected that you simply over-provision and scale as necessary to get the throughput you need. I don’t personally see running through a profiler (ideally as part of CI) as being particularly difficult or onerous.
Only if the same people do both.
Or if dev that would rather focus on perf find it equaly motivating and fun to work on something else instead.
That’s my gut feeling, that writing fast C# basically ends up looking like C++/Rust/etc where the niceties of C# are no longer present.
Which is the same as rewriting Python in cython. It’s way faster, and it doesn’t look like Python, or have the benefits of Python, and now just looks like weird C.
Is there something specific you would like to see an example for?