Making 1M requests with Python-aiohttp(pawelmhm.github.io) |
Making 1M requests with Python-aiohttp(pawelmhm.github.io) |
If you're connecting to 127.0.0.1:8080, then each connection from 127.0.0.1 is going to be assigned an ephemeral TCP source port. There are only a finite number of such ports available, on the order of ~30-50k, which limits the number of connections from a single address to a specific endpoint.
If you're doing 100k TCP connections with 1k concurrent conections, it's feasible that you'll run into those limits, with TCP connections hanging around in some TIME_WAIT state after close().
Not that this would be a documented errno for connect(), but it's the interpretation that makes sense..
http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not... http://lxr.free-electrons.com/source/net/ipv4/inet_hashtable...
Hacky way to get around that is to enable tcp_tw_reuse which will let you reuse ports, but it can be risky if you get a SYN from the previous connection that happens to lineup with segment number of the current connection (which will close your connection). Shouldn't happen often, and if you can tolerate a small amount of failure is an easy way to get around this limit.
[0] http://blog.davidvassallo.me/2010/07/13/time_wait-and-port-r...
I'm a little surprised some simple googling didn't turn up any examples of this - I'm sure someone have tried it out in order to do some benchmarking of high-performance network servers/services?
Apparently ipv6 changes this to a single (loopback) address, but then again, with ipv6 you can use entire subnets per network card.
Actually Linux will fall back to using TCP timestamps to distinguish between different connections. Ironically people will disable timestamps too to "fix" other issues[1] which also break PAWS[2] and may cause the issue you describing.
[1] It can break with some NAT and some load balancers. Actually the way I learned about tcp_tw_reuse was when we plugged in a new load balancer. We tested everything worked fine, but as soon as we sent production traffic many connections took few seconds to complete. Took 2 weeks to find the cause and looking at packet dumps. Turns out that the issue was that the load balancer was set up in active-active configuration, so different connections had different timestamps. This caused Linux to get confused and ignore some packets. Turned out one of managers wanted to make everything performant and copied some sysctls (that included tcp_tw_reuse and tcp_tw_recycle) from Internet without much though. After restoring the setting everything worked flawlessly.
[2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...
[1]: https://idea.popcount.org/2014-04-03-bind-before-connect/
https://github.com/mayfield/cellulario
And an example of using it to manage a multi-tiered scheme where a first layer of IO requests seeds another layer and then you finally reduce all the responses.. https://github.com/mayfield/ecmcli/blob/master/ecmcli/api.py#L456What I suspect though is that asyncio is not all that better than gevent. Can someone correct me on this?
overall - very less reason to consider Py3 at all. Performance would have been one - if there were a comparison between gevent and asyncio.
Here, mioco handling 10M http request per second(1) on my desktop:
https://github.com/dpc/mioco/blob/master/BENCHMARKS.md
1) with a bit of cheating http server.
With actual proper http parsing it goes down to 368K req/s, but that's still a lot.
https://www.techempower.com/benchmarks/#section=data-r12&hw=...
We wrote aiohttp for our production system. We build everything on aiohttp. In our production systems we constantly run more request then in the benchmark with business logic on each request.
The main reason we like aiohttp a lot if that you we can write asynchronous code that reads like synchronous and does not have callbacks.
This will provide two benefits:
1. You won't need to use a semaphore. To limit connections you will need to create a TCPConnection() object with limit set to the limit you used in the semaphore and pass it to the ClientSession() and aiohttp will not make more connections than the limit set (default behavior is to have unlimited number of connections).
2. With single ClientSession(), aiohttp will make use of keep-alive (i.e. it will reuse same connections for next requests, but it will keep at most the limit of connections you set in TCPConnection() object).
This should improve performance further, and (given sane limit) it'll also solve issue with "Cannot assign requested address" error.
BTW: Even without limit set aiohttp will try to reduce number of connections open so it might still fix the connection error issue as long as individual requests don't take long. It's still good idea to set limit, just to be nice to the remote server.
Some pointers:
The event loop is still a single thread and therefore subject to the GIL. That means that at any given time, only one coroutine is running in the loop. This is important for several reasons, but probably the most relevant are that
1. within any given coroutine, execution flow will always be consistent between yield/await statements.
2. synchronous calls within coroutines will block the entire event loop.
3. most of asyncio was not written with thread safety in mind
That second one is really important. When you're doing file access, eg where you're doing "with open('frank.html', 'rb')", that's something you may want to consider moving into a run_in_executor call. That will block the coroutine, but it will return control to the event loop, allowing other connections to proceed.
Also, more likely than not, the too many open files error is a result of you opening frank.html, not of sockets. I haven't run your code with asyncio in debug mode[1] to verify that, but that would be my intuition. You would probably handle more requests if you changed that -- I would do the file access in a run_in_executor with a max executor workers of 1000. If you want to surpass that, use a process pool instead of a threadpool, and you should be ready to go, though it's worth mentioning that disk IO is hardly ever cpu-bound, so I wouldn't expect you to get much performance boost otherwise.
Also, the placement of your semaphore acquisition doesn't make any sense to me. I would create a dedicated coroutine like this:
async def bounded_fetch(sem):
async with sem:
return (await fetch(url.format(i)))
and modify the parent function like this: for i in range(r):
task = asyncio.ensure_future(bounded_fetch(sem))
tasks.append(task)
That being said, it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.[1] https://docs.python.org/3/library/asyncio-dev.html#debug-mod...
Ummm that seems a bit far reaching.
For high-concurrency purposes, asynchronous programming is far more scalable (see: epoll/kqueue + state machines).
For high-throughput, low-concurrency operations, it doesn't matter as much.
Am I missing something? What's so amazing about this?
I just deployed some production feed that serves at 1955 requests/second on a cheap VPS in freaking PHP, one of the slowest languages out there.
The article is not about testing performance of a web server, but showcasing performance differences between synchronous and asynchronous code using asyncio. So, not about serving requests, but consuming.
I'm genuinely curious.
I would be interested in anything doing 10,000+ req/sec on a cheap VPS. 320 is nothing.
People achieve 2 million requests/second with C++ on EC2:
https://medium.com/swlh/starting-a-tech-startup-with-c-6b5d5...
I've switched to Python3.5 and aiohttp for all new web service applications. The coding style is clean, enjoyable to write, and easy to debug.
Plus, I've never once been stymied for speed. I know there's applications out there where people expect to to be handling zillions of connections -- but the bulk of my use cases think 100 transactions per second is a huge through-put, and aiohttp handles that with ease.
Supposedly you can do this in javascript by running node with a particular flag, then connecting to a port on localhost, and opening the chrome debugger. However, the multiple times I've tried throughout 2014-2016 has show that to be incredibly finnicky. It is especially frustrating when trying to insert a debugger into an automated test.
JavaScript has a concurrency model based on an "event loop". This model is quite different than the model in other languages like C or Java.
...
A very interesting property of the event loop model is that JavaScript, unlike a lot of other languages, never blocks. Handling I/O is typically performed via events and callbacks, so when the application is waiting for an IndexedDB query to return or an XHR request to return, it can still process other things like user input.
In what ways do languages that were supposedly designed for async programming different than Python?
Python is definitely lacking an elegant interface for async programming.
The downsides have nothing to do with the design of the language. The problem is that introducing a new concurrency model late in a language's life splits the ecosystem. Most existing packages are synchronous, so if you want to build asynchronous systems you must avoid packages like requests, django, or sqlalchemy and find (or develop) asynchronous equivalents for the functionality you need.
Javascript has an advantage here not because the design of the language is especially well-suited for asynchronous programming, but because it never went through a synchronous/multi-threading phase. Every javascript package is designed for asynchronous use.
Obviously there is Twisted and Tornado - but gevent or asyncio are actually the paradigms that people are using now. If there were a Flask like framework that was ground up built to leverage async (rather than bolting it on) and included all the batteries for web development.. then python would have a serious edge over node.
> You would probably handle more requests if you changed that -- I would do the file access in a run_in_executor with a max executor workers of 1000.
This is really good point. I'm going to check this and edit post adding this information there.
> Also, the placement of your semaphore acquisition doesn't make any sense to me. I would create a dedicated coroutine like this:
looking into my semaphore code next day after writing it I do wonder if I'm using it correctly. I assumed it works correctly because it fixed my "too many open files" exception, so it seems to mean that I'm no longer exceeding 1024 open files limits. Can you clarify why you think my use of semaphore does not make sense and why your suggestion is better? What is the benefit of dedicated coroutine?
> That being said, it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.
I admit that I focused more on my client than server. One thing that worries me about my test server is that it does not print any exceptions. Either it does not fail at all, which seems unlikely, or it fails silently, which is more likely and is bad. So I need to check my server code to see what exactly happens there.
> it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.
main reason for semaphore in client code is that it should stop client from making over 1k connections at a time. My logic here is that if client wont make 1k connections at a time - server wont receive 1k connections at a time and thus there will be no problem of too many open files on server (it won't have to send more than 1k responses). However I see that this logic may not be totally correct, other comment points out that it's possible for sockets to "hang around" after closing: https://news.ycombinator.com/item?id=11557672 so I need to review that and edit post.
> https://docs.python.org/3/library/asyncio-dev.html#debug-mod...
this looks really great, will look into this thanks.
> I assumed it works correctly because it fixed my "too many open files" exception
It works, so at the end of the day that's what matters. The client vs server question, from my perspective, ultimately comes down to a question of test realism; in a real-world deployment you couldn't limit connections with client-side code because there are multiple clients. That's what I mean by "it doesn't make sense given that the error is server-side".
> Can you clarify why you think my use of semaphore does not make sense and why your suggestion is better? What is the benefit of dedicated coroutine?
I'm saying that mostly, but not exclusively, from a division of concerns standpoint. You're acquiring the semaphore in a completely different context than you're releasing it. On the one hand, that's partly a programming style issue. On the other hand, it can also have some really important consequences: for example, it's actually the event loop itself that is releasing the semaphore for you when the task is done. Because of the way the event loop works, it's hard to say exactly when the semaphore will be released. You want to hold it for the absolute minimum time possible, since it's holding up execution of other connections in the loop. Putting it into a dedicated coroutine makes it clearer what's going on, makes it such that the acquirer and releaser of the semaphore are the same, and means you are definitely holding the semaphore for the minimum amount of time possible (since, again, execution flow will not leave any particular coroutine until you yield/await another). In general I would say that releasing the semaphore in a callback is significantly more fragile, and mildly to moderately less performant, than creating a dedicated coroutine to hold the semaphore and handle the request.
Does that all make sense?
> Either it does not fail at all, which seems unlikely, or it fails silently, which is more likely and is bad.
That's a fair statement, I think. As an aside, the print statement is slow, so keep that in mind. It might actually be faster to have a single memory-mapped file for the whole thing, and then just append the error and traceback to the file. The built-in traceback library can be very useful for that. That's also a bit more realistic, since obviously IRL you wouldn't be using a print statement to keep track of errors. On a similar note, because file access is so slow, you'd be best off figuring out some way to remove the part where the server accesses the disk once per connection entirely. On a real-world system you'd possibly use some kind of memory caching system to do that, especially if you're just reading files and not writing them. That allows you to use a little more memory (potentially as little as enough to have a single copy of the file in memory) to drastically improve performance.
Really you could say that about a number of features/changes in python 3, not just the new async syntax. Python 3 itself was an ecosystem-splitting instrument.
C++/Proxygen =1,990,130 requests per second
Python/Tornado = 41,329 requests per second
Thanks for sharing btw
Well, for this particular start-up, I think c++ was an excellent choice (especially as they already knew c++!) -- but what if you could still run the service on a single service with python? You would still need 2-3 servers with the c++ version (failover, test etc).
Or it might turn out that for production load, you'd need 10 python server-instances. Sure, one server could handle this with c++ -- but you're not actually saving 39/40, you're only saving 9/10, because you didn't need "the full 40".
Not to mention, that it seems unlikely that the http request stuff is the limiting factor in a distributed OLAP system. So you might do 100 "OLAPS/s", and have them easily served by a 40Kreq/s python service.
"I guess we could have written it in Python to start off with but, economically, it would be a wastage of labor cost and time because, at some stage, we would have to scrap it for a C++ version to get the performance we need. The Python code will have no economic value once scrapped."
As mentioned, I think c++ was an excellent choice, but the above assumes they'd have to rewrite the entire system in c++, and scrap all the python. Granted, if you don't know python, but do know c++ very well -- it's doubtful that it would be faster to prototype in python than to just use c++.
But if (wild guesstimate) you could get 90% of the features, at 10% the loc/dependencies in python -- who's to say that wouldn't make sense? Maybe parts could be done in c/c++, or the project could move to pypy for enough of a speedup that no complete rewrite would be needed... etc.
Overall though, I really think this is a great illustration that high-performance compiled languages offer great performance, and can be rather pleasant to work with.
yeah it does make sense.
> in a real-world deployment you couldn't limit connections with client-side code
yeah that's a very good point. But in a real world scenario handling this would not be that easy. Limiting number of available connections on the server side is not a trivial task to implement. Setting your server to avoid failures and simply return either 503 service unavailable to some clients or 429 (too many conns) to others would probably require quite a lot of coding. It's also not very clear to me how this would be implemented, how do people implement things like this? Just putting some check for number of open files before line that opens file and setting response code to 500 and 429 before opening file? This would only stop server from opening to many "html" files, but would not stop server from getting flooded with connections. Is my aiohttp app even the right place to add checks like this? Wouldn't it be better to use haproxy or nginx or some other load balancing service in front of aiohttp app and let it handle too much traffic?
Other thing that comes to my mind (need to check this later) is that perhaps some partial "handling" of cases like this could/should be implemented in aiohttp library. I'm not sure how it behaves now, but maybe it should simply fail to open file, return 500 to the client, and print noisy traceback about open files to my logs? I didnt see this behavior when doing my tests, so either it didn't occur, is not implemented in aiohttp, or it occurrred and I somehow missed that. From my experience with Twisted I know that this is how Twisted resources behave, if you have some unhandled exceptions twisted just returns 500 to client and show traceback in logs.
> Just putting some check for number of open files before line that opens file and setting response code to 500 and 429 before opening file?
So actually this is one of the big benefits of putting the semaphore limiting file access within its own dedicated coroutine (except on the server side instead of the client). It allows you to handle the connection without having to deal with immediate responses. What that means in practice is that your server will be slower to respond under high load, but until it hits the client's (browser) timeout limit, you'll still be able to respond. It actually doesn't require any extra code to do that. Note that this isn't the only way to achieve this result, but it's probably the most direct, and simplest, especially given the approach you've taken with the code thus far.
A load balancer sits on top of that, ideally monitoring metrics like server CPU usage, memory load, or (most directly) request response time, and then shifts around requests between servers accordingly, to minimize the delay incurred in the aforementioned "wait for semaphore (or other synchronization primitive)" part.
At the end of the day, until you start hitting the limit of concurrent connections that others have mentioned, you don't really actually need to worry very much about how many connections you have open at once. You just want to focus on handling every connection you have as quickly as possible.
I think that's a little strong. It's more like "The group controlling Javascript has mostly tried to discourage introduction of things that block".
You can, for example, do a blocking XMLHttpRequest. It's deprecated, but possible. https://jsfiddle.net/923d5sda/
A 10.000 repetitions for look while block the whole interpreter for its duration.
Any JSON parsing does the same.
Processing strings.
Doing math work.
...
The other languages don't have a builtin concurrency model. For C, Java and others event loop libraries and applications that are built on top of them can be found (nginx, netty, ...). As well as there are libraries that build on top of synchronous IO.
The eventloop was also not tied to Javascript in the older standards. Only the introduction of Promises and other stuff required the existence of an eventloop in order to define when continuations should run.
"The event loop model never blocks" is also only true as long as you (and all the libraries that you use) do not block it. There is no automatic "does not block" guarantee.
JavaScript actually always blocks. It's only external function calls that somebody took care to write in an evented style that don't block -- but anything written in pure Javascript (from for loops to text manipulation) blocks.
JS single-thread async without preemptiveness is not something to write home about...
Well, Erlang (and Elixir) does just that -- it's preemptive, and implicitly yields under the covers even in loops.
>Node.js people don't really think that sort of thing is "non blocking", do they?
Judging from forum threads and blog posts, a lot of them do, especially web programmers not familiar with blocking and non-blocking that only know that "Node is webscale".
Unicode? Not having to deal with encoding all over the place has been well worth switch to Python 3. If performance is a a huge issue, I honestly don't know why you would stay on Python (regardless of version)
I wouldn't want to switch back to Python 2.7 is I can avoid it. There's honestly no reason not to go with 3.4 or 3.5 at this point, unless you happen to have a large Python 2 code base.
> do_something(*some_args, *some_more_args)
is rocking my world right at the moment. That's a massive time saving feature I've been waiting for and worth the price of a 3.5 upgrade.I see this a lot. That one doesn't care of performance that much to switch to, say, C++ or Go, doesn't mean one is OK with regressions to what they currently use.
But in large I agree. 3 hasn't provided enought of a carrot and 2 hasn't been enough of a pain for many people to want to switch. Especially when it comes to existing stable code bases. For new development, yes, many can and should pick Python 3. But if say Python 3 brought even a 20% speed improvement overwall, the move would have been a lot faster.
I find people are ok with accepting some breakages due to re-writes if either existing stuff is very broken, or new stuff is so much better.
I have zero problems with Python 3 per se - but when the vast majority of the ecosystem is on Py2 and there is no difference in performance... then I see no reason to consider any breakages.
As a result, guys like you who are thinking rationally become the problem for being 'lazy' (acting in your own best interests which is exactly what everyone is doing) and are the enemy. People, especially newcomers, get tired of waiting for Python2 to 'die' and instead of put the onus on those who made the mistakes with Python3 (they've stuck it out, refusing to correct their own mistakes because that's more work), it becomes twisted and you are now the problem in their mind. Even though you were probably a part of what made Python successful to begin with. Amazing how that pans out, right?
Usually these illusions go away once a full time job is found, there are some but the vast majority are companies with big Python2 codebases that have features to deliver. If they move services anywhere from CPython it's to PyPy for the performance gain.
Very helpful for porting: http://python-future.org
- Python 2 will be EOL in 2020, that's four years
- vastly improved Unicode support
- a number new libraries are Python 3 only
- asyncio and the new async syntax
- exception chaining (!)
- type annotations
- lots of improvements all over the placeRunning out the clock on waiting for library support to become mostly python 3 is inevitable, but so far has been slower than molasses.
I wont dispute you on any other aspects - except two. have you tried using gevent versus asyncio ? gevent is running in production at several of the largest API services in the world. Asyncio is not yet deployed at this scale.
Second about new libraries being python 3 only - really dispute that. In fact its the other way around. For example, the brand new Tensorflow library (which google uses in production for its own AI) was released on Python 2 only .. and Python 3 support was later patched in. This is the case with every new library of consequence that I'm seeing.
At some point, the opportunity cost of staying with Python 2.7 is higher than the one-time effort of porting everything to Python 3, so companies will move. Especially the likes of Google and Dropbox.
gevent vs. asyncio - sure, gevent is more mature. But asyncio is undoubtedly the better/nicer API. asyncio's explicit await syntax is much nicer than gevent's implicit monkey-patching, which makes it harder to reason about the code.
Tensorflow isn't really a new library, it was only recently open sourced.
As for Python 3-only libraries:
- https://pypi.python.org/pypi?:action=browse&show=all&c=595
Also note that Django will drop Python 2 support in time with the Python 2 EOL: https://www.djangoproject.com/weblog/2015/jun/25/roadmap/
Also, many new Python-based open source projects are Python 3 only.
There's a cost to maintain, and that's why PSF is moving to 3. I'm sure if Google and/or Dropbox would pay, PSF might continue maintenance[1].
Anyway the issue here is that you won't get any new features in python 2.7, and that already happened last year, currently python 2 only gets security fixes.
> have you tried using gevent versus asyncio ? gevent is running in production at several of the largest API services in the world. Asyncio is not yet deployed at this scale.
Both of them are available on python 3. On python 2 you can't chose. AsyncIO strength is that it is integrated into the language. Especially in 3.5 you have a new syntax to use it, no need to install new libraries, no need to compile anything. No monkey patching. Also AsyncIO is more generalized and could be adapted to other tasks, not necessarily for asynchronous calls.
[1] In fact surely Red Hat will continue to maintain it until 2027 since they decided to ship it with Red Hat 7, but question is whether they will share it with general users, or will they only provide it to paying customers.
Management is fine with development time spent on migrating to Python 3, since it's an investment in the future (Python 2 will be EOL in 2020!).
I use Python3 sometimes as well, but doesn't mean what I'm saying isn't true.