Show HN: Cannoli – A compiler for a subset of Python written in Rust

Show HN: Cannoli – A compiler for a subset of Python written in Rust(github.com)

292 points by joncatanio 8 years ago | 94 comments

joncatanio 8 years ago |

I recently finished the code for my thesis and wanted to share with you all :). The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues. Quantifying the cost of these features could be valuable to language designers moving forward. Some interesting results were observed when implementing compiler optimizations for Python. An average speedup of 51% was achieved across a number of benchmarks. The thesis paper is linked on the GitHub repo, I encourage you to read it!

This was also my first experience with Rust. The Rust community is absolutely fantastic and the documentation is great. I had very little trouble with the "learning curve hell" that I hear associated with the language. It was definitely a great choice for this work.

I also included PyPy in my validation section and "WOW". It blew both Cannoli and CPython out of the water in performance. The work they're doing is very interesting and it definitely showed on the benchmarks I worked with.

sandGorgon 8 years ago | |

I still don't understand why Pypy hasn't been adopted by Google or Dropbox (the standard bearers of the Python ecosystem) as a forward looking investment. It is constantly underfunded (https://pypy.org/py3donate.html) and given the potential for the work that's happening, I don't understand why these guys don't write cheques for a few hundred k.

joncatanio 8 years ago | | |

After I ran the experimental evaluation, I had similar thoughts. If PyPy ever matches the current version of CPython I'm not sure why one wouldn't use PyPy over CPython. The biggest hurdle is matching support for popular libraries like NumPy, Tensorflow, Pandas, Scipy etc. I know they're working on supporting these, it's definitely a lot of work to do, easier said than done.

sanxiyn 8 years ago | | |

I am hoping Facebook to fund PyPy given that Instagram runs on Python.

It seems Google and Dropbox are not interested. Google is working on Grumpy, Dropbox worked on Pyston.

pjmlp 8 years ago | | |

On Google's case it appears they see more worthwhile to migrate their Python code into Go and Swift than improving Python runtimes.

Remember Unladen Swallow?

rplnt 8 years ago | | |

Wouldn't the biggest issue still be that C modules either don't work or are slow? I'd imagine it's much better to be able to solve performance bottlenecks by using cython/c than to have overall faster runtime, but no option to go further.

nudpiedo 8 years ago | | |

Engineers usually have more fun rewriting everything in “that new shiny tool that people is speaking about”.

Managers enjoy avoiding conflicts.

Very rarely someone in a position of power will point out to this kind of solution, which anyway is going to be against wishes of many employees.

sametmax 8 years ago | | |

The Python ecosystem in general is severely underfunded despite all big players using it extensively.

I think one reason is that the community is doing too good of a job. The language is pretty sane, it solves most problems right, the libs and docs are good, and the general direction thinks take is reasonable. And it's free not only as beer and freedom, but also free from business influences. The PSF is really giving away pretty much everything.

Everybody contribues a little (we have the brett canon team from ms, the guido team from dropbox, the alexi martelli team from google, mozilla even donated for pypy, etc). But it's nothing massive. Nobody said "ok here is 10 millions euros, solve the packaging problem".

Compare to JS: the language started as slow, with terrible design, and no consensus on the direction to take. So eventually, people (Google first) pourred a load of money to it until it became usable, and they had a cleaner leadership. They had huge problem to solve on the ever expending market that is the web plateform. Of course JS as the unfair advantage of a captive audience and total monopoly on its field.

Remember Unladen shallow ? "Google" attempt to JIT Python ? It was just one guy during his internship (http://qinsb.blogspot.fr/2011/03/unladen-swallow-retrospecti...).

And look at the budget the PSF had in 2011 to help the community: http://pyfound.blogspot.fr/2012/01/psf-grants-over-37000-to-... I mean, even today they have to go though so many shenanigans for barely 20k (https://www.python.org/psf/donations/2018-q2-drive/).

But at the same time you hear people complaining they yet can't migrate to Python 3 because they have millions of lines of Python. You hear of them when they want to extend the support for free, but never to support the community.

It's ridiculous.

Also compare to PHP: the creators made a business out of it, plain and simple.

Compare to Java/C#/Go: it's owned by huge players that have a lot of money engaged.

Python really needs a sugar daddy so that we can tackle the few items remaining on the list:

- integrated steps to make an exe/rpm/deb/.app

- JIT that works everywhere

- mobile dev

- multi-core with fast and safe memory sharing

There are projects for that (nuikta, pyjion, kivi, etc), but they all lack of human power, money and hence integration, perfs, features, etc.

You need a simple way to code some GUI, make it work on mobile or desktop, turn it into and exe and distribute it.

You need a simple way to say "this is a long running process, JIT the hell out of it".

gergo_barany 8 years ago | |

Interesting work! I have a bunch of comments and questions.

> The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues.

In another life I did something similar using a similar compiler simulation technique, looking at other Python features like redundant reference count operations, boxing of numbers, dynamic type checks etc. See G. Barany, Python Interpreter Performance Deconstructed. Dyla'14. http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf

After obtaining the numbers in that paper, the work didn't really go anywhere; there were no really obvious optimizations to try based on the data. But it was fun!

Anyway, questions:

1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

I ask this in particular because if I understand your thesis correctly, you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value. That's a lot faster. And you would get it automatically if you started from bytecode. (Plus, it would be easier to parse, but if you have fun parsing stuff, that's reasonable too.)

2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

ptx 8 years ago | | |

Regarding the bytecode - it was always considered an internal implementation detail subject to change (unlike the JVM bytecode) and in 3.6 they have in fact made a fairly major change[1]:

"The Python interpreter now uses a 16-bit wordcode instead of bytecode which made a number of opcode optimizations possible."

They haven't been shy about changing it in the past either, since there's no guarantee of stability, so it's likely to continue to change in incompatible ways.

[1] https://docs.python.org/3/whatsnew/3.6.html#optimizations

pddubs 8 years ago | | |

> you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value.

This is only true for function arguments right? Module level bindings and class and object attributes are looked up in dictionaries. I think the same for variables used in closures too?

mpweiher 8 years ago | | |

> the work didn't really go anywhere;

That's really too bad.

> there were no really obvious optimizations to try based on the data.

Is that because Python already is the way it is? In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?

Asking for a friend ;-)

joncatanio 8 years ago | | |

That work is great!

> We have presented the first limit study that tries to quantify the costs of various dynamic language features in Python.

This is spot on what we were doing as well, that's great to have this as a reference.

> 1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

We did not consider this actually. This would be a very interesting concept to explore. For the unoptimized version of Cannoli we do look up variables in a list of hash tables (which represent the current levels of scope). We did perform a scope optimization that then uses indices to access scope elements and this was much faster. However, it meant that the use of functions like `exec` and `del` were no longer permitted since we would not be able to statically determine all scope elements at run time (consider `exec(input())`, this could introduce anything into scope and we can't track that).

If you know, how does CPython resolve scope if it maps variable names to indices? In the case of `exec(input())` and say the input string is `x = 1`, how would it compile bytecode to allocate space for x and index into the value? I don't have much experience with the CPython source, so please excuse me if the question seems naive :)!

> 2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

Elements of the Value enum (that encapsulates all types) relied on `Rc` and `RefCell` to defer borrow checking to run time. Consider a function who has a local variable that instantiates some object. Once that function call has finished Cannoli will pop that local scope table and all mappings will be dropped when it goes out of scope. The object encapsulated in a `Rc` will have it's reference count decremented to 0 and be freed.

This is how I've interpreted the Rust borrow checker, I will say that this was the first time I had ever used Rust so it's possible that I am not completely right on this. But once that table goes out of scope, all elements should be dropped by the borrow checker and any Rc should be decremented/dropped.

> 3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

I did defer a lot of borrow checking to run time with Rc, but I tried to use this as little as possible to maximize optimizations that may result from static borrow checking.

> 4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

If you remove the 3 outlier benchmarks (that are slow because of Rust printing and a suboptimal implementation of slices), Cannoli isn't too far off from CPython. And in fact, with the ray casting benchmark, Cannoli began to outperform CPython at scale. This leads me to believe that the computations in Cannoli are faster than CPython. However, there is still a lot of work to do to create a more performant version of Cannoli. The compiler itself was only developed for ~4 months, I have no doubt that more development time would yield a better results.

That being said, I think the biggest slowdown comes from features of Rust that might not have been utilized. This is just speculation, but I think the use of lifetimes could benefit the compiled code a lot. I also think there may be more elegant solutions to some of the translations (e.g. slices), that could provide speedup. But I can't say that there is one thing causing the slowdown, and profiling the benchmarks (excluding the outliers) support that.

metalliqaz 8 years ago | |

I am aware of PyPy but have not used it myself. My understanding of PyPy is that it gains performance improvements mainly through a hotspot JIT compiler. If Cannoli compiles the entire Python program down to machine code (via rust) then how does PyPy "blow it away"?

joncatanio 8 years ago | | |

As others have commented, AOT compilation is limited to the information available at compile time. Various features of Python like dynamic typing and object/class mutation (via del) preclude many static analysis techniques. In Cannoli, this meant that the compiler had to also generate code that manages scope at run time. Whenever an identifier was encountered in the compiled code a hashmap would be searched to find the bound value. This overhead becomes expensive, and the thesis covers optimizations that avoid this. PyPy's JIT operates on the PyPy interpreter itself, finding linear lists of operations that are frequently used. It can then compile these operations to bytecode so the next time that trace is encountered it can execute the compiled code. The self-analysis at run time provides information that an AOT compiler just doesn't have.

That being said, I did leave a few suggestions in the "future work" section that talk about writing an AOT compiler for RPython (the version of Python that PyPy's interpreter is written in). This would provide more information at compile time and would be an interesting comparison between a Python interpreter compiled AOT versus a Python interpreter with a JIT (PyPy).

tathougies 8 years ago | | |

Compiling to machine code is not a panacea for optimization. A optimized JIT compiler is going to blow an AOT compiler out of the water. Being smart about the machine code generated is significantly more important than generating machine code. In particular, PyPy makes several optimizations over python code that a more direct implementation of CPython at the machine level probably wouldn't. For example, PyPy erases dictionary lookups for object member access if the object shape is statically known. Given how prevalent this kind of lookup is in Python code, it's possible that even an interpreter that made this optimization would be faster than a machine code version that used an actual hash table.

I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy, but as it stands right now, PyPy simply benefits from having years of optimization work that a new project doesn't.

ori_b 8 years ago | | |

For most dynamic languages, the available speedups aren't in simple compilation, but in removing the runtime type checks, method lookups, and other slow operations. This needs the ability to guess at what the code is going to do based on past behavior, and generate specialized versions that get thrown away if the guesses are invalidated.

So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.

Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).

xapata 8 years ago | | |

JITs get to analyze both code and data and optimize for each machine deployed to. A static compiler can only analyze code and the machine used for compilation. If dependencies were pre-compiled, the static compiler won't be able to optimize their relationship with the project. If the machine is changed for deployment.

More information means better optimizations. JITs FTW.

jimnotgym 8 years ago | |

IIRC Nuitka the other Python compiler claims better performance than PyPy. Is this just years of optimisation?

http://nuitka.net/

joncatanio 8 years ago | | |

Does it claim better performance than CPython or PyPy? I can't quite find the reference to PyPy (after a quick scan of the page/github repo. It looks like a cool project! They seem to be doing a lot of optimizations, which they list on their github page https://github.com/kayhayen/Nuitka#optimization. It looks like the git repo was created ~2013 (I dunno if it was hosted/worked-on elsewhere prior to that) so they've had a few years to optimize. Cool project though!

sametmax 8 years ago | | |

It doesn't claim that at all.

collyw 8 years ago | |

Question, is Rust inherently faster than C?

I thought the main benefits were safer code. Is it just the fact that you looked at what needed optimized and put some effort in or did the language choice help?

steveklabnik 8 years ago | | |

It can be, but "inherently" is a bit strong. There's also the question of "the best Rust programmer vs the best C programmer" vs "the average Rust programmer" vs the "average C programmer" here too.

welder 8 years ago | |

This could be used to ditch the Python interpreter and distribute Python binaries for Win/Linux/Mac. When will standard library, exceptions, and inheritance support be added?

halflings 8 years ago | | |

Probably never. This is a subset of Python that would break most libraries.

The author says that this is a research project, and adding the standard library (if possible at all) would be a humongus task by itself.

noobermin 8 years ago | |

Now something that would be interesting: writing python extensions in rust.

steveklabnik 8 years ago | | |

This is already quite possible! There are even multiple libraries to help you get started. Extending languages like this is a huge use case for Rust; one of the first production Rust uses was extending Ruby like this.

xyproto 8 years ago | |

Nuitka is also an alternative implementation of Python.

jcelerier 8 years ago | |

> I had very little trouble with the "learning curve hell" that I hear associated with the language

Dude, you're doing a thesis in computer science. Of course it's easy for you.

TomMarius 8 years ago | | |

Well, a professional software developer should be at least on the same level - and these comments are made by professionals.

joncatanio 8 years ago | | |

Well, yes I see your point. But I wouldn't say that it was easy, just not as bad as it had been hyped up to be. Although I think most of that sentiment comes from older Rust versions, when lifetimes were defined in more places and you had to learn some of the more advanced concepts.

taoistextremist 8 years ago |

This isn't a really important question, but why the name Cannoli? I feel like you missed an opportunity here to call it "PycRust". (c standing for "compiled to" of course)

joncatanio 8 years ago | |

My heritage is Italian and I happened to be eating cannoli when I started writing the parser around December, figured it'd be a fun name :)

taoistextremist 8 years ago | | |

Well, it is a fun name.

Do you plan to support this down the line? I feel like compilers that compile down to Rust could become really good tools depending on how the popularity of the language goes.

leshow 8 years ago | |

Having been around the Rust community for a few years, I'm a bit tired of -rs and rust names.

I feel like the postfix is something languages lose the bigger they get, I hope this happens to rust too.

jrs95 8 years ago | |

I think I read this as "Pike Rust" about 5 times before I got the joke. I guess it's time for that 2PM coffee.

tothrowaway 8 years ago | | |

Took me a minute too. "pie crust".

tomjakubowski 8 years ago | | |

Fortunately, that means the "pie crust" name is still available for a Pike[1]†-Rust project :-)

[1]: https://en.wikipedia.org/wiki/Pike_(programming_language) †: I have a particular fondness for Pike, because it's derived from the first programming language I ever used: LPC. Anyone else an LPC hacker?

harrisreynolds 8 years ago |

Nice work Jon! The cannoli logo is great!

Spun up a quick dashboard of the project here: https://chart.ly/github-dashboard/joncatanio/cannoli

Not tons of revelations there, but cool to see your longest streak was 7 days straight committing to the repo. Also cool to know this is part of your thesis.

What are your plans after Cal Poly?

joncatanio 8 years ago | |

This is very cool! Thanks for doing that :)!

I'm actually moving out to NYC this July to work for Major League Baseball. The Advanced Media division (MLBAM). I'll be doing some software engineering there, mainly API work for various apps, I'm very excited about it!

I'll have to work on compilers in my free time haha, I really enjoyed the work I did on this thesis.

alex_g 8 years ago |

This is awesome! I definitely recommend reading through Jon's thesis (link on GitHub). It's well written and very readable even if you know nothing of Rust or compilers.

ufo 8 years ago |

How was your experience using Rust as a target language (instead of C)? I understand that Rust has lots of features for when you want to write code by hand but do those also help when you are working with generated code? Or does the borrow checker get in the way all the time?

joncatanio 8 years ago | |

Great question! The "Compiling Python" section of my thesis is pretty much an explanation of how I had to translate elements of Python into Rust because of the borrow checker. There were a couple tricks (like using closures for functions) to getting around compile-time borrow checking. Some situations required the use of Rc & RefCell to provide multiple references to mutable data, this defers borrow checking to run time. So yes, the borrow checker got in the way. But I didn't have to write a garbage collector because the automatic memory management was handled via Rust's ownership rules (the caveat here is with cyclical references which would need to be tracked, this work was omitted for time).

It does complicate the generated code, I don't know if Rust is the greatest intermediate representation. But I do think it was a better choice than C. Debugging the generated code was so great because of the detail that the Rust compiler displays for warnings/errors.

I'd be interested in seeing how a Python interpreter written in Rust would compare to CPython, this would probably make use of more Rust optimizations (than trying to generate code).

ufo 8 years ago | | |

Ah, I hadn't realized that Cannoli is also using Rust-style memory management. In that case compiling to Rust would certainly help a lot.

emmelaich 8 years ago | |

Not to answer your question but if you want C / C++, have a look at Shedskin

https://github.com/shedskin/shedskin

tathougies 8 years ago |

Interesting project... Why python, out of curiosity?

joncatanio 8 years ago | |

I've used Python quite a bit for various projects. For a compilers class I wrote a compiler in Python and had a blast. So I spoke with that advisor and decided I wanted to get a Master's and he had suggested a project that analyses Python. The main question concerned which dynamic features of a language cause performance issues. Python just happened to have a lot of the features that we hypothesized caused slowdowns so we chose it. Plus we were both familiar with the language so that was a draw.

The same analysis could be done on JS or Ruby, it would be cool to see if a similar compiler would yield the same performance results for restricting features in JS/Ruby. It would also validate this work nicely as well.

gabcoh 8 years ago |

The name of the thesis this repo is a part of

> Leave the Features: Take the Cannoli - Jonathan Catanio

That's pretty good

Beltiras 8 years ago |

Leave the features - take the cannoli.

Made me laugh out loud.

alexnewman 8 years ago |

Skimmed it. They spent a lot of time flushing printing to the screen