Zed Shaw: "poll, epoll, science, and superpoll" with R

Zed Shaw: "poll, epoll, science, and superpoll" with R(sheddingbikes.com)

227 points by tdmackey 15 years ago | 141 comments

jacquesm 15 years ago |

In real-life web serving situations, and not in benchmarks, the majority of the fds is not active. It's the slow guys that kill you.

A client on a fast connection will come in and will pull the data as fast as the server can spit it out, keeping the process and the buffers occupied for the minimum amount of wall clock time and the number of times the 'poll' cycle is done is very small.

But the slowpokes, the ones on dial up and on congested lines will get you every time. They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently, first to see if they've finally completed sending you a request, then to see if they've finally received the last little bit of data that you sent them.

The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

zedshaw 15 years ago | |

So let's take your assertions and take them apart:

> the ones on dial up and on congested lines will get you every time.

Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently

Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.

> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.

Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.

But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:

> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.

jacquesm 15 years ago | | |

So, here are the numbers from one of the webservers that I instrumented to log the active-to-total ratio over a couple of hours.

The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).

After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.

The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.

So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.

In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.

So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.

This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.

Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.

If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.

bdr 15 years ago | | |

Read "on dial-up" as "slow". The argument depends only on there being a certain distribution of client speeds. It's not about dial-up in particular.

jemfinch 15 years ago | | |

> Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.

He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.

terra_t 15 years ago | |

Yeah, but there's a fetishization of "high concurrency" (being able to support a huge number of connections) rather than absolute performance.

For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.

End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.

Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.

jacquesm 15 years ago | | |

If you could increase the number of connections to 100,000 you would indeed be a genius because when you bind to a network interface using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).

As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.

So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.

I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.

Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.

jakevoytko 15 years ago | |

For simple testing purposes, it is easy to set up a forwarding proxy that drops n% of the packets it receives - for some high value of n. The World Wide Web is far more sadistic, but it still uncovers some performance or usability problems that are invisible over normal `localhost` traffic. I bet you can also use web servers with traffic shaping to mimic lots of slow connections at once, but I haven't tried that

dminor 15 years ago | |

Mongrel2 is supposed to handle WebSockets as well as HTTP, so I think open connections with sporadic traffic are a use case Zed has to worry about.

FooBarWidget 15 years ago |

Zed isn't the only one who has found epoll to be slower than poll. The author of libev basically says the same thing. See http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod and search for EVBACKEND_EPOLL.

I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.

jfager 15 years ago |

It is worth pointing out that the original epoll benchmarks were focused on how performance scaled with the number of dead connections, not performance in general:

http://www.xmailserver.org/linux-patches/nio-improve.html

And as jacquesm points out, in a web-facing server, that's the case you should care about. A 15-20% performance hit in a situation a web-facing server is never going to see doesn't matter when you consider that the 'faster' method is 80% slower (or worse) in lots of real world scenarios.

I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.

pmjordan 15 years ago |

Pardon my ignorance, I haven't built high performance servers at this low a level, but I'm intrigued:

What exactly is the definition of an "active" file descriptor in this context?

My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.

EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.

If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.

axod 15 years ago |

Sounds like premature optimization to me. Is this really the bottleneck? Is the extra complexity and logic really going to be a net win?

frognibble 15 years ago |

The blog post does not say if the epoll code uses level triggering or edge triggering. It would be interesting to see the results for both modes. The smaller number of system calls required for edge triggering might make a difference in performance.

zedshaw 15 years ago | |

That's entirely possible, but then you pay a penalty in complexity because you have to keep track of missed events yourself. I think (unproven) that it's actually a wash because of this.

frognibble 15 years ago | | |

At most, you need to track a couple of booleans per socket, one for read and one for write.

Depending on what you are doing, you might not even need to track these booleans. For example, on the read side you can ignore read events when you are not interested in reading. When you switch back to read interest, you can read the socket to see if data arrived while you ignored events. A similar strategy can be used on the write side.

KirinDave 15 years ago |

Is it just me, or did Zed not describe his testing methodology in any detail?

I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.

zedshaw 15 years ago | |

There's the pipetest.c file that everyone uses (since 2002) linked off that blog post, but I got tired and went to sleep.

Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.

Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.

KirinDave 15 years ago | | |

Okay. And I appreciate that, I'll look.

To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:

1. A detailed explanation of your methodology, preferably with source code. This is so we can reproduce the tests. The ability to reproduce your work is a critical part of any process calling itself science.

2. A detailed list of the hardware you used & its deployment. (For reasons listed above).

3. Your raw data should be made available upon request so other people can work it as well.

P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?

kunley 15 years ago |

Cool experiment Mr Zed, but what about kqueue?

It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.

silentbicycle 15 years ago | |

kqueue is on OpenBSD and FreeBSD, while epoll is from Linux. (poll and select are on both)

kunley 15 years ago | | |

I'm aware of it (you forgot to mention that kqueue is on the OS X as well). So what?

There are probably hordes of people who will be willing to run Mongrel2 on *BSD platforms, precisely because of the performance reasons. And Zed is a famous tinkerer rather than a religious zealot, so very probably he could be interested in checking kqueue as well.

"Why not" is also a good reason for a hacker when he's lacking other reasons.

gthank 15 years ago | | |

What about NetBSD? Zed has already said he uses NetBSD, so if kqueue is there, he might add it to the mix.

bch 15 years ago | | |

NetBSD also supports kqueue

zedshaw 15 years ago | |

Well, I haven't tried kqueue, but IIRC it has its own set of problems. Mainly that you can't kqueue certain types of file descriptors like ptys. I'd have to look into that, but I'm sure I'll have some kind of thing going about it soon.

kqueue 15 years ago |

Lets assume we have 20k opened FDs.

In case of poll(), you have to transfer this array of FDs from the userland vm to the kernel vm each time you call poll(). Now compare this with epoll() (let's assume we are using EPOLLET trigger), when you only have to transfer the file descriptors once.

You might say the copying won't matter, but it will matter when you have a lot of events coming on the 20k FDs which eventually leads to calling xpoll() at a higher rate, hence more copying of data between the userland and kernel (4bytes * 20k, ~80kbytes each call).

zedshaw 15 years ago | |

Yep, that's what I thought too, that at least epoll would be as fast. Turns out it's not though, but then I could be wrong.

Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.

kqueue 15 years ago | | |

Sorry, I meant level-triggered. :) I think edge-triggered does add an extra overhead as you stated.

pphaneuf 15 years ago | | |

Why would there be extra overhead when using edge triggered? There's definitely extra complexity on the client side, but it's close to what you're trying to do with super-poll (the extra complexity is basically to find out when an fd isn't busy anymore).

I think it might even be faster, kernel-side. From what I remember of the implementation, both modes have to walk the same list of ready fds, but that list is shorter in edge triggered mode, because they get removed from the list as it goes.

Edge triggered might have more overhead if many fds change between ready/not-ready quickly, but that's quite the wacky situation (and if it has an even distribution, would ensure your ATR is about 0.5, so probably still winning).

FooBarWidget 15 years ago | |

Why would there by any copying? The kernel can directly read userspace memory.

kqueue 15 years ago | | |

For the kernel to execute a system call, it has to place the arguments on its stack. a system call doesn't execute in the userland.

phintjens 15 years ago |

Zed, whats with all the premature optimization? Surely Mongrel2 should first be able to make coffee, build you an island and f@!in transform into a jet and fly you there, before you start to make it faster!

Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.

jaekwon 15 years ago |

0.6 is so arbitrary. it should be 1.0/golden-ratio.

zedshaw 15 years ago | |

I was hoping for e, but alas no luck.

pphaneuf 15 years ago | | |

The best would be if it would be possible to code up superpoll to be adaptive, and in effect, benchmark itself to come to the same conclusion, dynamically. So if one day the kernel people fixed epoll to be better all the time, Mongrel2 would magically not use poll() much on systems using that kernel, and favour epoll.

Of course, that's often Kinda Hard To Do (tm). ;-)

aston 15 years ago | | |

It's pretty darn close to 1 - 1/e.

pphaneuf 15 years ago |

Question: as the ATR is going higher, so would the proportional time spent in poll or epoll, no?

So if you have a thousand fds, and they're all active, you have to deal with a thousand fds, which would make the difference between poll and epoll insignificant (only twice as fast, not even an order of magnitude!)?

This would make the micro-benchmark quite micro! Annoyingly enough, I think that means that the real way to find out would be an httpperf run, with each backends. A lot more work...

16s 15 years ago |

Very nice write-up. Little details such as this should make Mongrel2 very solid. It's nice to see how he analyzed the issues around poll and epoll and then figured out how to make use of both for optimum performance no matter what happens in production. Many other programs could benefit from this sort of analysis although at different levels... e.g. Sorted vectors may be better for smaller containers but hash tables better for larger containers, etc.

lukesandberg 15 years ago |

interesting article! Is 'super-poll' done yet? i would have liked to see a super poll line on some of those graphs to see how it compares to just vanilla poll and ePoll at different ATRs. Though i guess you would also have to test for situations where ATR varies over time (so that you could measure the impact of moving fds back and forth).

c00p3r 15 years ago |

It is a little wonder why this kind of people think that everyone else are just stupid to realize such things. What they want is a fame and followers. (btw, don't you forget to donate!)

hint: nginx/src/event/modules/ngx_epoll_module.c

May be one should learn how to use epoll and, perhaps, how to program? ^_^