Python at Scale: Strict Modules

Python at Scale: Strict Modules(instagram-engineering.com)

388 points by apadmarao 6 years ago | 251 comments

More and more I want someone to create a new language that amounts to a strict subset of Python, with mypy built-in, and is compilable into machine code. Python has by far my favorite syntax, community, and in my experience leads to the greatest productivity. There just happens to be a lot of overly dynamic features, that aren't even used by most, but used just enough to hold back optimization and structural improvement.

staticassertion 6 years ago | |

https://github.com/python/mypy/tree/master/mypyc

This may interest you. This is probably going to be the 'official' way to get what you're talking about.

timothycrosley 6 years ago | | |

Thank you for sharing! This looks really promising, I'll try to think of ways I can contribute to the project.

ledauphin 6 years ago | | |

yeah, I'm excited about this approach. but it's a long way from being a realistic approach for folks outside Dropbox, it seems.

cpeterso 6 years ago | |

I like Python, but I often wonder how many developers use Python because they actually use dynamic language features versus just liking the languages' clean syntax and library ecosystem. I'm surprised languages that offer both REPL (for development) and AOT native compilation (for production), like OCaml, are not more popular. Evidence that syntax matters, I guess. :)

mypy and mypyc are interesting but their compile-time checks and optimizations are still hampered by Python's dynamic language semantics.

clintonb 6 years ago | | |

Don’t underestimate inertia. I’ve worked with Python and Django for seven years. I know the libraries in the ecosystem. I know the framework. It’s far easier for me to start a project with Django than to learn another framework or language.

peteradio 6 years ago | | |

Names matter and OCaml is a crappy name.

pjmlp 6 years ago | | |

The dynamic language semantics of Lisp, Scheme, Smalltalk, JavaScript have not hampered the existence of good JIT/AOT compilers.

Smalltalk, for example you can completely change the structure of a class by sending a become: message.

What I think is missing is a bit of more PyPy love, and the Truffle and OpenJ9 Python support efforts.

Sophistifunk 6 years ago | | |

I think a great deal of this sort of thing could be done by just doing some eval in a dynamic state before you stop the vm and compile its stable state, rather than the actual source code.

scrollaway 6 years ago | |

Python has some of my favourite syntax as well but I absolutely hate its annotations. TypeScript got typings right.

I think the killer language will be typescript with access to both the python and JavaScript ecosystems. We'll see what that looks like.

And of course if something changes the syntax, better anonymous functions will be the absolute first thing I would look for...

timothycrosley 6 years ago | | |

> TypeScript got typings right.

I have not used TypeScript, but looking at it's documentation the syntax for type annotations look identical. Would you be willing to expand on why you think its approach is better / how it's different?

breatheoften 6 years ago | | |

> I think the killer language will be typescript with access to both the python and JavaScript ecosystems. We'll see what that looks like.

I think this is an extremely good idea. Python is horrible but forced on a huge number of developers because of its ecosystem ... I think a bridging layer from typescript to python could be built in a way similar to swift’s Python Interop — and I don’t think it would require any special language support ...

I think could actually make a better/easier to use/more robust design than Swift by requiring all interactions with the python interpreter from node be async.

TylerE 6 years ago | | |

I don't think there is any sort of long-term future in anything "Python". I think a successful modern language has to have the potential for efficient concurrency baked in, which isn't really possible without breaking compatibility, and the Python community would never survive another round like the 2->3 transition. (And I'm not convinced the community really survived that one either, given the amount of ongoing bitterness about the whole situation).

moksly 6 years ago | |

Aside from having a two decade history of using C#, types is the only thing preventing us from going full Python. Even so, the dynamic types in Python are more often a benefit than a disadvantage because Python is so great at handling them automatic.

We build our employee database, and from there our IDM, from a singel XML file in a really shitty format + three txt files in even worse formats (they are single line output files from an old mainframe system predating sap). We used to do it in a rather complicated Microsoft SSIS workflow with a lot of C# services. All in all it’s a 30 minute nightly runtime. I recently replaced it with around 500 lines of Python and a 1-5 minute Runtime (sometimes at the beginning of a school year we’ll see changes to around 1000 positions).

Python eats the XML like it wasn’t shit. It takes things like terrible date formats, we’re talking the output of a SAP free-text box shitty, and ports then seamlessly into a SQL date field. This alone was a nightmare in C# and Python just does it.

Still, after two decades of strict types it feels dangerous.

Rotareti 6 years ago | |

I can imagine the next big programming language will be one that is split into two language-variants: the "low-level-variant" and the "high-level-variant".

The high-level-variant is a dynamic language with optional typing, which is good for scripting, fast prototyping, fast time-to-market, etc.

The low-level-variant is similar to the high-level-variant (same syntax, same features mostly, same documentation), but it has no garbage collector, typing is mandatory and it runs fast like C/C++/Rust. Compiled packages that are written in the low-level-variant can be used from the high-level-variant without additional effort at all. The tooling to achieve this comes with the language.

A language like this would be insane, IMHO.

nikki93 6 years ago | | |

A key consideration here would probably be the expression and passing around of managed instances spawned in the high-level variant through low-level code. Would you explicitly retain and release them? -- etc. I think it should be an ergonomic solution for this language to provide an edge over just using C / C++ / etc. with Lua / Python / etc.

alex7o 6 years ago | | |

You can say that this is typescript and assemblyscript, they have the same syntax but one of them is compiled natively (wasm).

jsmeaton 6 years ago | |

For the sites I typically work on it’s very hard to give up the Django admin and all of the features it provides.

At the same time, I’d love a stronger type system to avoid a bunch of the pitfalls that the dynamism of python has.

So count me in.

thelastbender12 6 years ago | |

Very much this! For numerical computing, Numba + llvmlite attempts to do it.

I don't know however if this approach could be extended to other domains - say making a web framework. Given, python classes let you do so much tinkering, any attempts to port existing code will probably need a lot of rewriting?

totalperspectiv 6 years ago | |

I'm hoping someone with more experience will chime in, but what about rpython / pypy?

chenzhekl 6 years ago | |

How about Nim? It has Python-like syntax and is as fast as C. https://nim-lang.org/

timothycrosley 6 years ago | | |

Answered this below:

> I've been tracking nim, and would agree it's the most promising so far! I feel though that it's trying to be too flexible in many ways. Examples of this include allowing multiple different garbage collectors and encouraging heavy ast manipulation. I'm also afraid it is different enough to keep it from attracting a significant amount of developers from the Python community. Nonetheless, it's something I plan on using and contributing to, since it's the best option so far.

Though, now that another commenter pointed out mypyc: https://github.com/mypyc/mypyc I believe I'll invest my limited free-time in that project instead, as it will allow me to stay within the Python community and eco-system that I love so much.

Jefro118 6 years ago | | |

Just in case it's of interest to anyone reading this, I interviewed the designer of Nim, Andreas, about his design choices and what he learned from Python and the C family here: https://sourcesort.com/interview/andreas-rumpf-on-creating-a...

Gives some good insight into where Nim is going in the future too.

strokirk 6 years ago | | |

It's certainly interesting to use! However, it's type checker still have a lot of work to go, since you can easily segfault due to using a nil reference.

bratao 6 years ago | |

There is https://github.com/python/mypy/tree/master/mypyc that I think is a great idea and approach

paulie_a 6 years ago | |

I completely agree. With python I need ten packages. With the shit show of JavaScript I need 100 conflicting packages. Why bother on a backend framework like js. it's a worthless language for backend development

Nimitz14 6 years ago | |

Yeah I was hoping Nim would be it but I don't like the syntax they use.

carapace 6 years ago | |

Cython? Nuitka?

timothycrosley 6 years ago | | |

I use Cython a lot! But mostly to speed up existing Python code, and build C-extensions faster. I don't see it as a strict subset of Python or a new language to build a community around. Nuitka I just started experimenting with to build standalone Python executable, and I really like the direction and roadmap they are following. In the end though both of these technologies seem like ways to somewhat speedup existing Python code and not attempts to introduce a strict language subset that would allow the greatest amount of optimization, and finally fix long running issues, like the inability to have multiple versions of a package installed.

TylerE 6 years ago | |

nim

timothycrosley 6 years ago | | |

I've been tracking nim, and would agree it's the most promising so far! I feel though that it's trying to be too flexible in many ways. Examples of this include allowing multiple different garbage collectors and encouraging heavy ast manipulation. I'm also afraid it is different enough to keep it from attracting a significant amount of developers from the Python community. Nonetheless, it's something I plan on using and contributing to, since it's the best option so far.

weberc2 6 years ago | |

Sounds like Go. ;) This is a cheeky remark, but I use Python and Go, and Go very much feels like an improved Python in most ways. Especially when it comes to static analysis, build tooling, distribution, performance, etc. In particular, I love that there are no venvs, pipenvs, virtualenvs, pyenvs, wheels, eggs, setuptools, easy_installs, etc.

nine_k 6 years ago | | |

What Go adds in tooling and performance, it takes away in expressivity.

What takes 3 lines in Python, takes 10-30 on Go.

timothycrosley 6 years ago | | |

I hate the fact that you may be right, because I really don't like Go in many ways:

- I hate it's module system and package eco-system story. - I don't like its syntax. - I don't like its error handling. - I'd much prefer gradual typing. - I want to maintain the ability to use interactive interpreters. - I don't like the fact that instead of being community driven it is Google driven.

But, anecdotally, I see go being used as a second language to Python more than anything else and at an ever accelerating rate.

ledauphin 6 years ago | | |

Go may "feel" like Python, but it's almost nothing like Python in actual practice. It's not dynamic (and doesn't even have generics), and its error handling is dramatically different.

allan_s 6 years ago |

> This means that just by importing this module, we're mutating global state somewhere else.

Yes, this !

That's why I hate Django and some flask app the most for, the fact that by importing a module, you're implicitly creating a database connection, and a lot of other magic stuff, which mean that now I can't import a constant defined in said module outside of `python manage.py`

Also as said below in the article, suddenly it's much harder to handle smoothly the "the database is momentary unavailable" (because someone has put the line starting the database connection in the global space of a module somewhere)

I much prefer frameworks/modules for which code is executed only once you invoke their "setup" function

ledauphin 6 years ago |

I love the idea, but it feels like just an idea at this point. I'd rather read about them releasing their 'compile-time' analyzer and revealing their measurements for how much startup time it saves.

In our codebase, we have pretty strict developer-enforced rules about not doing I/O at the module level, usually through the use of simple "Lazy" wrappers for module-level objects. I'd be curious to know what other approaches people have taken with Python here.

rectangletangle 6 years ago | |

It is an interesting approach, though I feel like this could introduce some nasty unintended consequences given how dynamic and introspective Python can be (admittedly I haven't studied this particular implementation).

I always treated this a bit like single underscore private functions/methods, i.e., follow a convention that produces code that's easy to reason about, even if it's not strictly enforced by the language/compiler. So in practice this equates to separating out modules that mutate global state, and placing the majority of logic in "strict" modules that only declare a bunch of "pure" classes/routines. So the "non strict" code is really just a thin layer of wiring gluing everything together. For instance my Celery task files tend to be very thin.

ledauphin 6 years ago | | |

well, we also heavily use static typing, so you end up with something like

my_db_conn: Lazy[DbConn] = Lazy(lambda: make_db_conn(...))

and MyPy will tell you if you're doing something silly when you try to use it.

EDIT: After typing up this response and submitting I realize you were talking about their strict approach rather than ours. whoops :)

jedberg 6 years ago |

It's interesting to me that they are going down this path instead of the microservices path. This seems like something ripe for slowly breaking down into microservices.

Someone made a change that took down production because of non-deterministic outcomes? How about break out whatever they were changing into it's own service? With proper fallbacks, breaking that part shouldn't take down all of production again.

To be clear, I'm not saying microservices will solve all their problems or be less work. I'm just saying that with an equal level of effort, they would probably get more overall reliability by having multiple services, they'd be able to use multiple languages, whatever is suited to the task at hand, be able to deploy even more often with less risk, and be able to isolate these types of "change on import" behavior to a much smaller surface on any given deployment.

ben509 6 years ago |

> How do we know that the log_to_network or route functions are not safe to call at module level? We assume that anything imported from a non-strict module is unsafe, except for certain standard library functions that are known safe.

It's hard to know anything about the stdlib as it can be monkey patched, e.g. [1]

That said, you could solve this with diagnostics; calculate signatures of stdlib functions and classes to find any known safe ones that were patched. Run that check in your test suite to find problematic imports.

> If the utils module is strict, then we’d rely on the analysis of that module to tell us in turn whether log_to_network is safe.

I like this. It seems far more usable than proposals like adding const decorators.[2]

[1]: https://github.com/gevent/gevent/blob/master/src/gevent/monk...

[2]: https://github.com/python/typing/issues/242

miki123211 6 years ago |

This is yet another example of the divide between wizarding and engineering[1]. When you're a small startup, what matters is the expressiveness of your language, and the ability do do a lot of things very very quickly. Type safety, performance, readability, those things don't matter. You're just a bunch of engineers who know the whole codebase inside out, you're pretty certain of what you're doing. In short, you're wizarding. If you grow big enough, this approach slows you down greatly, and you need to switch to engineering. You sacrifice some speed for making the codebase more understandable to a larger group of people, you can no longer assume everyone knows all the code, you write unit tests, need types and dislike metaprogramming because of the confusion it creates. This is why languages like Python, Ruby, Lisp or Smalltalk are amazing for small startups, but Java is what enterprises use. They're different ends of the wizarding/engineering spectrum. I wish there was a language that let you move gradually from one end to the other, exactly when you need to.

[1] https://www.tedinski.com/2018/03/20/wizarding-vs-engineering...

k_sze 6 years ago |

Another thing that I would like to see in some kind of strict mode is the ability to mark explicit exports like in JavaScript modules. I often want to import multiple things globally at the top of a module because they are shared by multiple class or function definitions that I am writing. However, such imports end up being exposed to and usable by the consumers of my module, even though the consumers should really have imported those things at their source instead of via my module.

There are currently maybe two ways to tackle this “problem”, without a strict mode:

1. Don’t import at the global module scope; but that’s a bit tedious.

2. Import with rename, like `import os as _os`, and then leave it to the principle of “we’re all consenting adults”. I.e. if anybody imports and used things that start with an underscore, it’s clearly their fault, not mine.

andreareina 6 years ago | |

3. Import as normal, and leave it to the principle of "we're all consenting adults"; unless something is explicitly called out as being part of the public API I consider Law of Demeter[1] "violation" the same as accessing _var.

[1] https://en.wikipedia.org/wiki/Law_of_Demeter

alexchamberlain 6 years ago |

I think this is an interesting idea, which appears to embed a stricter subset of Python within Python itself. Have the Instagram engineers tried floating this with the wider community via established channels like Python-Ideas or discuss.python.org?

jbmsf 6 years ago |

I like the idea, but it feels a bit heavy handed outside of a very large team.

I think the first step here is to get away from the assumption that importing a module will have "interesting" side effects. This is not only a problem with Python...

I tend to create mini "dependency injection" frameworks that create a pattern for loading module code at some point well after import. This patterns tends to reduce to wrapping whatever code you have in the module in a function/closure instead of just running whenever.

Again, I like the idea of enforcing constraints with code, but I don't think it's a substitute for educating developers to avoid certain patterns and giving them infrastructure that makes the alternative easy.

marcoseliziario 6 years ago |

https://docs.python.org/3/library/importlib.html#importlib.u...

time4tea 6 years ago |

Wow. Talk about solving the wrong problem!

Millions of lines of code in a monolith. 20s start up time. Meta monkey patching. One unit test per process... Yikes!

Software architecture, anyone?

Maybe Instagram should get a copy of Michael Feathers' book...

rurban 6 years ago |

I like that idea, it's just not that easy. How to do define module versions and inheritance, when you are not allowed to do global assignments in the module. declarations only, and no IO or global side effect is fine, but declaring versions and inheritance need to be allowed in global scope.

I added these ideas here: https://github.com/perl11/cperl/issues/406

tahdig 6 years ago |

> ... many of whom are new to Python.

well, if you ask me to write language X, I would definitely make mistakes for the first couple of weeks/months/years, that is why you need code review, mentoring and education plans for your hires.

> Here’s another thing we often find developers doing at import time: fetching configuration from a network configuration source.

  MY_CONFIG = get_config_from_network_service()

I am pretty sure this an anti-pattern, if this code passed the code review, you should make your review process more strict.

  def myview(request):
    SomeClass.id = request.GET.get("id")

> Likely you’ve already spotted the problem

Well, yes, why would you do this? why would this pass code review? why do we we have linters and other checks for dynamic languages

> It works great for smaller teams on smaller codebases that can maintain good discipline around how to use it, and we should switch to a less dynamic language.

It seems we are here blaming python for shortcomings of a monolith also, instead of chunking out specific businesses modules to separate services/micro-services.

TO be honest the strict mode seems interesting, but I believe the problems they seem to be facing can be solved by a couple of changes to their pocess and code:

- everyone gets a mentor if they are not experienced in python or django

- code review atleast by two experienced python developers(does not count if you have coded for Java for 20 years)

- teams should try to move their logic outside the monolith(it sounds like they have a monolith)

- write CI tests to measure how much time it takes to import a file, if it takes more than T(line count * LINE_PROCESSING_THRESHOLD) you have to fix your code.

- prepare config and load it before running the actual server, no network call for getting config

All in all, python is suitable for big companies also, the thing is if don't care about the best practices, you would also have problems when you are a small startup, but in a big co it would make it impossible to move forward, trick is to independent of the company size follow best practices and have code review.

scrollaway 6 years ago | |

That's a long post to say "do more code review instead of investing into technical solutions to technical problems".

Clearly, Instagram's solution saves them time. That means faster code reviews which incidentally makes them more accurate. Your post doesn't really make sense.

avip 6 years ago |

It's very important to think about objects lifecycle management.

It's also important to... use pytest fixtures instead of arbitrarily patching around in tests.

konschubert 6 years ago |

I have a question about a detail in the article:

> But if we moved the log_to_network call out into the outer log_calls function, [...] this would no longer compile as a strict module.

My current understanding is that the log_calls method would NOT get executed during module load time!?!

Why would having a side effect in this function violate the intention of __strict__ ?

scrollaway 6 years ago | |

> My current understanding is that the log_calls method would NOT get executed during module load time!?!

That's incorrect. log_calls gets executed on import because it's a decorator, so equivalent to `hello_world = log_calls(hello_world)` at the top-level (which does also get executed).

log_to_network in the _wrapped() definition doesn't get executed until hello_world gets called; but outside of the definition of _wrapped does get executed.

konschubert 6 years ago | | |

Right! I missed the fact that log_calls is used as a decorator further down.

tln 6 years ago |

Avoiding module side effects and making classes and modules immutable seem like two separate concerns

bjoli 6 years ago | |

Not really. Mutation in general, and in modules in particular, inhibit a lot of reasoning about the code, and thus stops a whole lot of optimizations from being possible. Guile (a scheme dialect) recently got declarative modules for that reason, where a top level binding cannot change (i.e. you cannot set! a binding, but you can wrap in it a mutable container and change the contents of that container). This makes procedure calls and variable lookup a lot faster. Andy Wingo wrote about it here: https://wingolog.org/archives/2019/06/26/fibs-lies-and-bench... .

Those optimizations won't mean much for cpython, since Cpython doesn't try to run things fast, but for something like pypy this could be a big deal.

bjoli 6 years ago | | |

To quote the article (from.memory): "adding static modules is probably the single most important optimization guile can do in the near future".

The quote is probably wrong, but it is right in spirit.

ianamartin 6 years ago |

Try Zope.Interface and Pyramid for a framework. You'll be really happy.

accidentaldev 6 years ago |

who would have thought Instagram is a python monolith. ?

carapace 6 years ago |

> Instagram Server is a several-million-line Python monolith

That's bananas.

Nothing Instagram does requires that much code.

Also, that much Python code means you're doing it wrong.

carapace 6 years ago | |

No, I'm seriously you guys.

Python is too expressive to require mega-LoC for that site.

You could implement an OS, relational DB, spreadsheet, and optimizing compiler all in less than that.

orf 6 years ago | | |

You have no idea about their codebase, the implementation details of their features nor how they counted the lines (comments included?). So stating that it’s dumb is beyond ridiculous.

You are right in that it’s certainly a high LoC count for Python, but still...

scrollaway 6 years ago | | |

As orf said you have no idea about their codebase. And you have no idea what's included in that statement -- given that they talk about startup time, they most likely are taking into account the whole framework, a plethora of admin and analytics tools, lots of debugging / debug-only infrastructure, migrations, lots of tooling whose sole purpose is making it easier to work in large teams, etc…

(And for the record, Linux is ~37 million lines of actual code, Postgres ~2 million, and gcc ~8 million)

There's nothing absurd about one of the most visited websites on earth to be a couple million LOC.

zestyping 6 years ago |

I like this a lot.

zallarak 6 years ago |

This article is among the best argument for using a typed language I’ve yet seen.

kbd 6 years ago | |

This has nothing to do with types. It's more about static guarantees the language gives about module import behavior.

nothrabannosir 6 years ago | | |

In OP's defence:

> So that's a third pain point for us. Mutable global state is not merely available in Python, it's underfoot everywhere you look: every module, every class, every list or dictionary or set attached to a module or class, every singleton object created at module level. It requires discipline and some Python expertise to avoid accidentally polluting global state at runtime of your program.

> One reasonable take might be that we’re stretching Python beyond what it was intended for. It works great for smaller teams on smaller codebases that can maintain good discipline around how to use it, and we should switch to a less dynamic language.

> But we’re past the point of codebase size where a rewrite is even feasible. And more importantly, despite these pain points, there’s a lot more that we like about Python, and overall our developers enjoy working in Python. So it’s up to us to figure out how we can make Python work at this scale, and continue to work as we grow.

Those are literal quotes from the article. That is quite damning. How did they get to this point? By starting when Python was appropriate, and taking it day by day.

iso-8859-1 6 years ago | | |

Depends what you mean by "type". A type in e.g. Haskell specifies whether there are side effects.

ken 6 years ago | |

Surely this is a typo and you mean “functional language”, as mutability and state is the main issue here, not dynamic typing.

brenden2 6 years ago |

It still blows my mind that people don't use strongly typed languages in the first place and spare themselves from all this future pain.

My guess (based on my experiences) is that companies wind up in this position from having inexperienced people building early versions of products instead of hiring experienced engineers (who are usually more expensive).

b3orn 6 years ago | |

Python is strongly typed, just not static.

brenden2 6 years ago | | |

Python uses duck typing: https://en.wikipedia.org/wiki/Duck_typing

I would categorize it as a subset of dynamic typing, and that's what Wikipedia says too.

ainar-g 6 years ago | |

It's a constant struggle against the current. Dynamically-typed languages are often “good enough for the time being”. I have the same issue explaining to our C/C++/Obj-C team why they should use static (Clang-Tidy, Infer, PVS-Studo) and dynamic (ASan, MSan, UBSan) analysis tools. They just keep giving me basically the same response of “I am a good programmer, and my code is good, and shame on you for even daring to think that a mere machine could find bugs in my code!”. I don't know what kind of status anxiety causes it. It also makes me think about what kind of other I am missing because of the was I keep thinking that I do that thing well-enough myself.

ken 6 years ago | | |

I'm confused. It should be easy to demonstrate the benefit, if there is one. Just show them the bugs!

For me, it's not "status anxiety". It's simply not worth the effort.

The last couple static analysis tools I ran on my programs, I spent a while getting the tool to not-crash (because even though the authors obviously had a static analysis tool themselves, they either didn't bother to run it on their own code, or it wasn't good enough to find actual issues). These tools flagged only a couple issues, and almost all of them were places where it couldn't really cause any problems, but the type system was not strong enough for me to prove why it couldn't go bad. So I spent a while sorting through false-positives.

I'm not going to spend hours with a tool to find only a couple (real) bugs, which no user has ever reported seeing, and which I've gotten no automated crash reports about. I have much better uses for my time.