Making 1M requests with Python-aiohttp

Making 1M requests with Python-aiohttp(pawelmhm.github.io)

123 points by dante9999 10 years ago | 81 comments

terom 10 years ago |

Re the EADDRNOTAVAIL from socket.connect(),

If you're connecting to 127.0.0.1:8080, then each connection from 127.0.0.1 is going to be assigned an ephemeral TCP source port. There are only a finite number of such ports available, on the order of ~30-50k, which limits the number of connections from a single address to a specific endpoint.

If you're doing 100k TCP connections with 1k concurrent conections, it's feasible that you'll run into those limits, with TCP connections hanging around in some TIME_WAIT state after close().

Not that this would be a documented errno for connect(), but it's the interpretation that makes sense..

http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not... http://lxr.free-electrons.com/source/net/ipv4/inet_hashtable...

ahuang 10 years ago | |

Generally its the upper 32k ports that are ephemeral, and if your churn more than that per minute in connections, you'll run into that TIME_WAIT issue.

Hacky way to get around that is to enable tcp_tw_reuse which will let you reuse ports, but it can be risky if you get a SYN from the previous connection that happens to lineup with segment number of the current connection (which will close your connection). Shouldn't happen often, and if you can tolerate a small amount of failure is an easy way to get around this limit.

[0] http://blog.davidvassallo.me/2010/07/13/time_wait-and-port-r...

e12e 10 years ago | | |

For benchmarking loopback connections, addressing really shouldn't be an issue, as you have an entire /8-subnet to split between your client(s) and server(s) (127.0.0.0/8). You would need some logic to set up eg 10.000 listening servers, and 1000.000 clients to get it working, and at some point you'd probably run into memory or other limits.

I'm a little surprised some simple googling didn't turn up any examples of this - I'm sure someone have tried it out in order to do some benchmarking of high-performance network servers/services?

Apparently ipv6 changes this to a single (loopback) address, but then again, with ipv6 you can use entire subnets per network card.

takeda 10 years ago | | |

> Hacky way to get around that is to enable tcp_tw_reuse which will let you reuse ports, but it can be risky if you get a SYN from the previous connection that happens to lineup with segment number of the current connection (which will close your connection)

Actually Linux will fall back to using TCP timestamps to distinguish between different connections. Ironically people will disable timestamps too to "fix" other issues[1] which also break PAWS[2] and may cause the issue you describing.

[1] It can break with some NAT and some load balancers. Actually the way I learned about tcp_tw_reuse was when we plugged in a new load balancer. We tested everything worked fine, but as soon as we sent production traffic many connections took few seconds to complete. Took 2 weeks to find the cause and looking at packet dumps. Turns out that the issue was that the load balancer was set up in active-active configuration, so different connections had different timestamps. This caused Linux to get confused and ignore some packets. Turned out one of managers wanted to make everything performant and copied some sysctls (that included tcp_tw_reuse and tcp_tw_recycle) from Internet without much though. After restoring the setting everything worked flawlessly.

[2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

oxplot 10 years ago | |

Localhost goes from 127.0.0.1 through 127.255.255.254. By binding each connection to a random IP in that range [1], one could get a better mileage.

[1]: https://idea.popcount.org/2014-04-03-bind-before-connect/

tooker 10 years ago |

I have a library for doing coordinated async IO in python that addresses some of the scheduling and resource contention issues hinted out in the later part of this post. It's called cellulario in reference to containing async IO mechanics inside a cell wall..

    https://github.com/mayfield/cellulario

And an example of using it to manage a multi-tiered scheme where a first layer of IO requests seeds another layer and then you finally reduce all the responses..

    https://github.com/mayfield/ecmcli/blob/master/ecmcli/api.py#L456

simonw 10 years ago | |

This looks really promising. I've often wanted to be able to do exactly this: run a bunch of async code in the middle of an otherwise synchronous block (classic example: writing a Django view which fires off a bunch of parallel HTTP API requests and continues once all of them have either returned or timed out).

tooker 10 years ago | | |

That's almost exactly the use case I began with. It unapologetically requires python 3.5+ but if you're already there I'd be happy to see and support some of your use cases. Hit me up on github if you want to try it and need some guidance (the docs are nonexistent).

ulyssesv 10 years ago | | |

Could you please elaborate on why async would be preferred than a task queue solution (would it)?

sandGorgon 10 years ago |

I really keep wishing that there would be benchmark comparisons of asyncio/aiohttp with gevent/python2 . Performance would be a killer reason to migrate immediately to Py3.

What I suspect though is that asyncio is not all that better than gevent. Can someone correct me on this?

riyadparvez 10 years ago | |

Is there anything inherent to Python3 that is slower than Python2? Or is it just some of the performant packages still have not been ported to Python3?

sandGorgon 10 years ago | | |

i keep looking for a reason to switch to python 3 and cant find one. Plus if I want to use the cool stuff in Pypy.. then I better not !

overall - very less reason to consider Py3 at all. Performance would have been one - if there were a comparison between gevent and asyncio.

poooogles 10 years ago | | |

Ironically enough unicode takes its toll on performance. This has mostly been made up though from 3.4 onwards.

velox_io 10 years ago |

The 1 million in the title is misleading (1M per hour is nothing to write home about, only 278/sec). There are frameworks that are able hit 1M per minute plus (16,666/sec).

dpc_pw 10 years ago | |

1 million per hour is nothing...

Here, mioco handling 10M http request per second(1) on my desktop:

https://github.com/dpc/mioco/blob/master/BENCHMARKS.md

1) with a bit of cheating http server.

With actual proper http parsing it goes down to 368K req/s, but that's still a lot.

jorge_leria 10 years ago | |

1M per minute it is something. Could you name those frameworks?

jc4p 10 years ago | | |

Elixir is the name I see thrown around the most when it comes to stuff like this: http://www.phoenixframework.org/blog/the-road-to-2-million-w...

pbz 10 years ago | |

Even 1M per minute is rather pathetic. The game is around multiple millions per second:

https://www.techempower.com/benchmarks/#section=data-r12&hw=...

ben_jones 10 years ago |

Does anyone enjoy doing async work in python? I've done a few hobby projects and honestly I was yearning for javascript + async lib after awhile. As great as python is maybe we should yield async programming to the languages designed for it?

philippb 10 years ago |

I'm the CTO at KeepSafe. We open sourced aiohttp.

We wrote aiohttp for our production system. We build everything on aiohttp. In our production systems we constantly run more request then in the benchmark with business logic on each request.

The main reason we like aiohttp a lot if that you we can write asynchronous code that reads like synchronous and does not have callbacks.

takeda 10 years ago |

IMO you should place all requests within a single ClientSession().

This will provide two benefits:

1. You won't need to use a semaphore. To limit connections you will need to create a TCPConnection() object with limit set to the limit you used in the semaphore and pass it to the ClientSession() and aiohttp will not make more connections than the limit set (default behavior is to have unlimited number of connections).

2. With single ClientSession(), aiohttp will make use of keep-alive (i.e. it will reuse same connections for next requests, but it will keep at most the limit of connections you set in TCPConnection() object).

This should improve performance further, and (given sane limit) it'll also solve issue with "Cannot assign requested address" error.

BTW: Even without limit set aiohttp will try to reduce number of connections open so it might still fix the connection error issue as long as individual requests don't take long. It's still good idea to set limit, just to be nice to the remote server.

nbadg 10 years ago |

First off, awesome to see more benchmarks (even if it's just personal experimentation) for synchronous vs asyncio performance. I think the real argument for asyncio right now is that it makes it very easy for you to write extremely efficient code, even for hobbyist projects. Even though your experiment is only handling 320 req/s, that you were able to do that so quickly and with very, very little optimization is, I think, a testament to the potential for asyncio.

Some pointers:

The event loop is still a single thread and therefore subject to the GIL. That means that at any given time, only one coroutine is running in the loop. This is important for several reasons, but probably the most relevant are that

1. within any given coroutine, execution flow will always be consistent between yield/await statements.

2. synchronous calls within coroutines will block the entire event loop.

3. most of asyncio was not written with thread safety in mind

That second one is really important. When you're doing file access, eg where you're doing "with open('frank.html', 'rb')", that's something you may want to consider moving into a run_in_executor call. That will block the coroutine, but it will return control to the event loop, allowing other connections to proceed.

Also, more likely than not, the too many open files error is a result of you opening frank.html, not of sockets. I haven't run your code with asyncio in debug mode[1] to verify that, but that would be my intuition. You would probably handle more requests if you changed that -- I would do the file access in a run_in_executor with a max executor workers of 1000. If you want to surpass that, use a process pool instead of a threadpool, and you should be ready to go, though it's worth mentioning that disk IO is hardly ever cpu-bound, so I wouldn't expect you to get much performance boost otherwise.

Also, the placement of your semaphore acquisition doesn't make any sense to me. I would create a dedicated coroutine like this:

    async def bounded_fetch(sem):
        async with sem:
            return (await fetch(url.format(i)))

and modify the parent function like this:

    for i in range(r):
        task = asyncio.ensure_future(bounded_fetch(sem))
        tasks.append(task)

That being said, it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.

[1] https://docs.python.org/3/library/asyncio-dev.html#debug-mod...

henryw 10 years ago |

Looks pretty interesting to do async on python. I once did something similar in node (async by default) with a few lines of code. I think I scraped 12 or 20 million real URLs in 8 hours on a $5 cloud VM. It was limited by network bandwidth.

azinman2 10 years ago |

"Everyone knows that asynchronous code performs better when applied to network operations"

Ummm that seems a bit far reaching.

15155 10 years ago | |

It depends on what "network operations" you are trying to do.

For high-concurrency purposes, asynchronous programming is far more scalable (see: epoll/kqueue + state machines).

For high-throughput, low-concurrency operations, it doesn't matter as much.

azinman2 10 years ago | | |

I happen to know of a very major tech company who scale is insane yet their core c++ code is based on highly tuned blocking threads. It's not a given that async is the only way to scale.

imaginenore 10 years ago |

1,000,000 requests in 52 minutes is just 320 req/sec.

Am I missing something? What's so amazing about this?

I just deployed some production feed that serves at 1955 requests/second on a cheap VPS in freaking PHP, one of the slowest languages out there.

kh_hk 10 years ago | |

> Am I missing something? What's so amazing about this?

The article is not about testing performance of a web server, but showcasing performance differences between synchronous and asynchronous code using asyncio. So, not about serving requests, but consuming.

lossolo 10 years ago | | |

Then he should change the title.

nathancahill 10 years ago | |

I don't care for PHP as much as the next guy, but it's usually in the top 25 of the web framework benchmark (most of the other top langs are Java, Go and C++): https://www.techempower.com/benchmarks/

50CNT 10 years ago | | |

Just curious, what's up with this Ur language at both position 1 and 4? Never heard of it, and probably not experienced enough to make sense of it, but how is it that a language that doesn't even have a full official tutorial to its name beat out java, C++ and Go in those rankings by a factor of >2?

I'm genuinely curious.

aaossa 10 years ago | |

Why you say is not amazing? Honestly curious here :)

imaginenore 10 years ago | | |

Because it's trivial.

I would be interested in anything doing 10,000+ req/sec on a cheap VPS. 320 is nothing.

People achieve 2 million requests/second with C++ on EC2:

https://medium.com/swlh/starting-a-tech-startup-with-c-6b5d5...

coldtea 10 years ago | | |

Because it's like Dr Evil asking the UN leaders for "ONE MILLION DOLLARS" to not destroy the world...

https://www.youtube.com/watch?v=cKKHSAE1gIs

merdreubu 10 years ago | | |

Because you can get 540 req/s on Raspberry Pi 2 with Elixir/Phoenix.

http://blog.onfido.com/using-cpus-elixir-on-raspberry-pi2/