Zed Shaw: "poll, epoll, science, and superpoll" with R(sheddingbikes.com) |
Zed Shaw: "poll, epoll, science, and superpoll" with R(sheddingbikes.com) |
A client on a fast connection will come in and will pull the data as fast as the server can spit it out, keeping the process and the buffers occupied for the minimum amount of wall clock time and the number of times the 'poll' cycle is done is very small.
But the slowpokes, the ones on dial up and on congested lines will get you every time. They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently, first to see if they've finally completed sending you a request, then to see if they've finally received the last little bit of data that you sent them.
The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.
> the ones on dial up and on congested lines will get you every time.
Do you have numbers on the dial-up users for your server? My understanding is that there's far fewer, so this is bogus. Show evidence of high dial-up penetration first.
> They keep the processes busy far longer than you'd want and you have to hit the 'poll' cycle far more frequently
Again, you have no numbers on the active/total ratio in your server, so unless you do this statement doesn't refute what I found. I've presented evidence that just shows the math of O(N=active) / O(N=total) holds up. Simple math. The only way epoll wins for all load types is if it is as fast as poll all the time. My tests show it's not, which stands to reason since it's implemented using more syscalls than poll.
> The impact of this is very easy to underestimate, and if you're benchmarking web servers for real world conditions you could do a lot worse than to run a test across a line that is congested on purpose.
Again, you have no definition of "congestion". If you adopt a simple metric like ATR then we can talk. As it is, you (and everyone else) just throws around latency numbers like those matter when really the performance break is in the ATR. In addition, my numbers show the performance break being at about 60% ATR, so if you're saying that no server every goes above 60% activity levels then you're totally wrong. 60% is not completely unreasonable on a loaded server.
But, I think you're missing a key point: You need both in a server like Mongrel2. I never said epoll sucks and poll rocks (since you probably didn't read the article). I said something very exact and measurable:
> epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.
If you don't think that's the case in "the real world" then go measure it and report back. That's the science part. I totally don't believe it yet myself, which is why I'm measuring it and showing the methods to everyone so they can confirm it for me.
The webserver is custom job called yawwws (yet-another-www-server) that is used to serve up a variety of bits and pieces for a high traffic website, typically the requests are very short in nature (a 500 byte request followed by a < 10K answer).
After about two hours of running the active-to-total ratio varied between 10% to 40% for 5 minute intervals, with the majority of the 5 minute buckets around the 30% mark. I'm actually quite surprised at the spread.
The bigger portion of the time seems to be spent waiting for the clients to send the request, most if not all of the output data should fit in the TCP output buffers, so that actually skews the results upwards, for longer running requests sending more data to the clients the active-to-total ratios would probably be a bit lower.
So 10% to 40% of all the sockets were active at any given time, the rest was idle, waiting for data to be received or for buffer space to be freed up so data could be written.
In this situation epoll would be faster than poll because epoll only sends the user process those fds that it actually has to deal with rather than all of them, so the loop that takes the output of the system call will have less iterations.
So, as I wrote before, I think the typical web server is, when it is dealing with the client facing side more often than not waiting for the client to do something, and it seems that on my server that hasn't changed since I last looked at it.
This server runs with keepalive off. Switching it on will most likely make the active-to-total ratio dramatically lower but I don't feel like pissing off a large number of users just to see how bad it could get. There is a good chance that my socket pool will turn out to be too small to do this without damage.
Chances are that for different workloads the percentages will vary but this setup is fairly typical (single threaded server, all requests served from memory) so I wouldn't expect to see too much variation on different sites, and if there is variation I'd expect it to go down rather than up.
If I get a chance I'll re-run the test on some other websites to see if the numbers come out comparable or are wildly different.
He doesn't need to show that it's high, only that it's high enough to cause a significant contingent of ordinary webservers' requests to be lingering slow connections.
For instance, you might have a system which has a latency of 1 second, and at a given workload, you have 10,000 connections. In the Java culture, people think you're a genius if you can increase those connections to 100,000 and increase the latency to 10 seconds.
End users, on the other hand, would be happier if you cut the latency to 0.1 seconds, but there are a lot of people who'll then think you're a loser who can only manage to handle 1000 concurrent connections.
Of course, getting that latency down is a holistic process that requires you to think about the client, the server, and what exactly goes over the wire.
As far as I know the only way around this is to use multiple IPS (possibly aliases on the same interface) but that would still require a new process.
So even if your per-process limit for fds can be larger than 64K the network layer or the mapper that turns fds in to socket ids for the network stack to work with may impose a restriction. I don't know enough about the linux kernel to figure out what exactly causes this.
I use the 64K limit on some high throughput machines (mostly video and image servers), but when I go over that I need to start another process. Possibly there's a way around that but the expense of another process is fairly small so I haven't put in much time to see if I can work around that. Socket to fd mapping presumably takes in to account the address as well as the port so it shouldnt't be a problem but on the kernel of the machines where I have to resort to these tricks it appears to be a limit.
Maybe someone with more knowledge of the guts of the linux kernel can point out why this happens.
I wonder how kqueue behaves compares to poll and epoll. Kqueue has a less stupid interface because it allows you to perform batch updates with a single syscall.
http://www.xmailserver.org/linux-patches/nio-improve.html
And as jacquesm points out, in a web-facing server, that's the case you should care about. A 15-20% performance hit in a situation a web-facing server is never going to see doesn't matter when you consider that the 'faster' method is 80% slower (or worse) in lots of real world scenarios.
I'll be interested to see how the superpoll approach ends up working, but my first impression is 'more complexity, not much more benefit'.
What exactly is the definition of an "active" file descriptor in this context?
My best guess after reading the man pages is that poll() takes an array of file descriptors to monitor and sets flags in the relevant array entries, which your code then needs to scan linearly for changes, whereas epoll_wait() gives you an array of events, thus avoiding checking file descriptors which haven't received any events. Active file descriptors would therefore be those that did indeed receive an event during the call.
EDIT: thanks for pointing out Zed's "superpoll" idea. I somehow completely missed that paragraph in the article, which makes the following paragraph redundant.
If this is correct, it sounds to me (naive as I am) as if some kind of hybrid approach would be the most efficient: stuff the idling/lagging connections into an epoll pool and add the pool's file descriptor to the array of "live" connections you use with poll(). That of course assumes you can identify a set of fds which are indeed most active.
Depending on what you are doing, you might not even need to track these booleans. For example, on the read side you can ignore read events when you are not interested in reading. When you switch back to read interest, you can read the socket to see if data arrived while you ignored events. A similar strategy can be used on the write side.
I can't even find a reference to his OS configuration and version details that he's developing on, which seems to me like a critical detail.
Today I'm crafting how I ran the tests and releasing all the code and asking everyone to test my results. I am completely assuming I am wrong so looking for other people to test it.
Incidentally, if you google for "pipetest.c" you'll it's kind of the gold standard for this comparison, so if that code is wrong, then the entire assumption that epoll is better needs to be redone.
To make your process scientific, I'd like to suggest you add the following things to the post when you find it convenient:
1. A detailed explanation of your methodology, preferably with source code. This is so we can reproduce the tests. The ability to reproduce your work is a critical part of any process calling itself science.
2. A detailed list of the hardware you used & its deployment. (For reasons listed above).
3. Your raw data should be made available upon request so other people can work it as well.
P.S., aren't you concerned about I/O overhead with your superpoll proposal? It seems like the added resource allocation and the time spent in zeromq is going to eat up the small advantages you gain?
It seems superior to both *poll minions. Would be great if you proved/falsified this thesis as well.
There are probably hordes of people who will be willing to run Mongrel2 on *BSD platforms, precisely because of the performance reasons. And Zed is a famous tinkerer rather than a religious zealot, so very probably he could be interested in checking kqueue as well.
"Why not" is also a good reason for a hacker when he's lacking other reasons.
In case of poll(), you have to transfer this array of FDs from the userland vm to the kernel vm each time you call poll(). Now compare this with epoll() (let's assume we are using EPOLLET trigger), when you only have to transfer the file descriptors once.
You might say the copying won't matter, but it will matter when you have a lot of events coming on the 20k FDs which eventually leads to calling xpoll() at a higher rate, hence more copying of data between the userland and kernel (4bytes * 20k, ~80kbytes each call).
Also, your assumption of EPOLLET is potentially wrong. I think (unproven) that the extra overhead and complexity of using edge trigger right makes EPOLLET pointless.
I think it might even be faster, kernel-side. From what I remember of the implementation, both modes have to walk the same list of ready fds, but that list is shorter in edge triggered mode, because they get removed from the list as it goes.
Edge triggered might have more overhead if many fds change between ready/not-ready quickly, but that's quite the wacky situation (and if it has an even distribution, would ensure your ATR is about 0.5, so probably still winning).
Just kidding. It's always nice to see science in action. Great work! I suspect there's an impact on ZeroMQ's own poll/epoll strategy.
Of course, that's often Kinda Hard To Do (tm). ;-)
So if you have a thousand fds, and they're all active, you have to deal with a thousand fds, which would make the difference between poll and epoll insignificant (only twice as fast, not even an order of magnitude!)?
This would make the micro-benchmark quite micro! Annoyingly enough, I think that means that the real way to find out would be an httpperf run, with each backends. A lot more work...
hint: nginx/src/event/modules/ngx_epoll_module.c
May be one should learn how to use epoll and, perhaps, how to program? ^_^
Yes, but where's the evidence what people see for active/total ratios in the real world? I'm showing that unless it's below about 60% (probably more like 50%) then poll is the way to go.
60% active isn't entirely unrealistic at all. I can see quite a few servers hitting those thresholds, so in that cases, poll vs. epoll doesn't matter.
I think what's more important in what I'm finding is that you really need both. It's entirely possible that you have servers that are at 80-90% ATR all the time. Others that are 10% ATR. The key is either you have to measure that, which nobody does, or you have to make a server that can adapt.
Yes Zed, where the fuck is it? You're claiming SCIENCE! based on your worst-case synthetic localhost benchmarks, and then turning around and wildly guessing as to real-world performance characteristics with internet latencies.
Worse, your whole thesis hinges off of ATR but you made no effort to measure it anywhere, instead you're passive-aggressively berating us to do it.
I'd be curious if you have any evidence that this occurs in practice. Even a busy server with clients of uniform + low latency, intuitively I'd expect fairly low ATRs.
I think what's more important in what I'm finding is that you really need both.
I'm not sure you do: the performance advantage of poll seems marginal at best. When ATR is high, you're presumably doing enough real work that the slight overhead of epoll vs. poll is probably not super important.
If a site gets spiked with the typical 'read-and-leave' traffic a link from reddit or huffpo or wherever generates, how does superpoll compare to straight epoll? Based on your description so far, I can only see it hurting - you're not just wasting time on dead connections in your poll bin, you're now also incurring the overhead of managing the migration over to the epoll bin.
The difference between poll and epoll is that, given an input of N file descriptors, poll returns all N file descriptors and you need to loop through each one of them to check whether the 'active' flag is set on there. epoll just returns all the active file descriptors so that you don't need to loop through the inactive ones.
A hybrid approach, as Zed has suggested, would appear to be more efficient on the surface. It remains to be seen whether it can actually be implemented efficiently because migrating fds from/to epoll is extremely expensive, requiring a single syscall per fd.
But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches. Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.
But if you ask me, the real solution is to have the kernel team fix their epoll implementation performance issues instead of forcing people to work around it with hybrid approaches.
That does indeed sound like a better conclusion.
Other than the stupid single-syscall-per-fd requirement, there's nothing in epoll's interface that would force it to perform worse than poll when the active/total ratio is high.
I don't see a reason why the syscall-per-fd couldn't easily be replaced/augmented with a single mass add/remove syscall which takes an array. The worse performance seems similarly baffling; it almost sounds as if they had some kind of inefficient data structure holding the file descriptor pool; considering poll() uses a flat array and epoll uses set operations I assume it's pretty tricky to make it perform well, even with a hash table. Maybe set operations aren't the best way to handle this data structure; but only some profiling in the kernel code can tell us that.
Obviously it'll take until 2.6.37 at least for any changes to enter the mainstream kernel, and until then a hybrid approach sounds sensible for those unwilling to patch. But still, fixing the root problem seems like a worthwhile cause.
active_fds = poll(big_ass_array_of_fds, total_fds)
epoll is slightly different but same concept. You have a total number of FDs you're want to know about, and each call returns a number that have had activity.
And that's it. You then just do active_fds/total_fds and that gives the ATR. If this is < 0.6 after your call to poll, then that call to poll would have been better done with epoll. If the active_fd/total_fd is > 0.6 then it's better to stick with poll.
Of course, it's more complicated than that, but this gives you a simple metric of the break point where one is better than another.
To put it in another way, if you were to use blocking IO then an operation on an active descriptor would not block. Of course poll and epoll are all about asynchronous IO (so non-blocking by definition) but that's a good way to describe the difference.
Zed's 'superpoll' is precisely what you suggest.
Zed's 'superpoll' is precisely what you suggest.
Facepalm. Thanks, I mysteriously missed that part of the article.
That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program.
By doing these tests in isolation instead of while running on a profiling kernel under production load it is very well possible that the bottleneck will not be the polling code at all but something entirely different. I'd say that this is a textbook example of what premature optimisation is all about.
Assuming you have a finite budget of time to spend on a project any optimisations done that take time out of that budget that could have been spent more effective elsewhere is premature.
Now there is a chance that this would have been the bottleneck in the completed system, but before you've got a complete system you can't really tell. My guess based on real world experience with lots of system level code that used both, including web servers, video servers, streaming audio servers and so on is that the overhead of poll/epoll will be relatively minor compared to other parts of the code and the massive amount of IO that typically follows a poll or an epoll call.
If you have 10K sockets open then typically poll/epoll will return a large number of 'active' descriptors, you'll then be doing IO on all of those for that single call to poll/epoll.
Each of those IO calls is probably going to be as much or more work to process than the poll call was.
Maybe Java does some of this cool stuff already so perhaps I'm shielded from the pain of dealing with things directly.
In the past I've written Java NIO code that dealt with around 60,000 concurrent connections pretty well. The time spent doing poll seemed to be completely insignificant. CPU usage was negligible.
It'd be good to see some numbers though - for example:
For average mongrel application, 40% of CPU time is spent in poll / average of 30ms latency is due to poll etc.
But I'm skeptical those numbers are true. That was my point.
If you don't start with those numbers and measurements, optimizations like this, whilst interesting, may end up being of no real use to anyone.
You're probably right that when you actually use Mongrel2 as your app server your app-specific code higher up will be a larger bottleneck, but that's code that you have to deal with and this is code that he has to deal with so optimizing the hell out of it doesn't sound like a bad idea.
That's 0.5% of your total time being spent here. So even if it's made twice as fast, your app will only speed up by say 100ms -> 99.75ms
Find the big things that matter and optimize them. Adding extra complexity to small things that don't matter is a recipe for more bugs and more issues.
Fortunately, Zed is the right guy to find this out. I'm certainly looking forward to the results of this--which I bet we'll have an initial answer to by tomorrow.
http://www.linuxinsight.com/ols2004_comparing_and_evaluating...
Yeah, I'm skeptical those numbers are true too, but then again we're talking about totally different numbers.
In other words, I've given a metsric, ATR at 60% is the break even point for poll vs. epoll. So far the only responses I've got haven't even tried to give out a metric, let alone say what their actual ATR is but they claim that it's low.
I'm a scientist, so in the same way I don't believe my own research, I don't believe their rhetoric.
I've never come across a scientist that took criticism of their work the way you do and that responded in the way you do. Shouting down, deriding, insulting and in general being a jerk to those that don't agree with you because 'you're a scientist' is not the way of science.
So, consider me the Richard Dawkins of epoll.
Of course all the little bits help and I'm happy to see someone pay attention to detail like this but normally speaking you should get to the point where you're shifting data in real life situations and you can hook up a profiler to make the decision. You have less to blog about like that but the difference between poll and epoll is not large enough that you would spend more time going from the one to the other than was spent analysing this and writing the post.
Optimisations like this are best left to when you have things working, first make it work, then make it fast.
That right there is how you correctly choose how to implement things, contrary to popular belief. Think first, write code later.
Someone actually took the time to sit down, ponder what kind of
workloads will be handled by his application
The most important point made in this thread is that Zed actually didn't do that, but did benchmarks for a range of workloads, suggesting he intends Mongrel2 to be able to handle all of them. The question is whether that is necessary. If ATR > 0.6 never happens in practice, it will only unnecessarily complicate Mongrel2.Anyway, we do have something working. You can deploy Mongrel2 right now, and I do all the time. You can actually measure how fast it is, although that's going to be slow since we haven't done much tuning so kind of pointless.
So, again you're wrong. We are at a point in the design where this measurement and analysis matters, and we have something that actually works to put it in. You should probably maybe go do some actual reading instead of posting here like you know what you're talking about.
"There are no servers that have an ATR of > 80%."
That's easy to test, and I'm damn positive you could find some that disprove your assertion.
More importantly though, you have this assertion:
"Using both poll and epoll has no advantage in performance."
Again, who knows, that's why I'm testing and trying out. That's the science part, since I've got no idea, but I'll give it a shot. And now that I've done an analysis that tells me what really matters, I'll be able to do very good tests for the different kinds of loads.
Incidentally, when people run performance tests against web servers to see how fast they serve files they're testing the server with an ATR at around 100%. Food for thought.
I cannot collect this data because I don't possess a sufficiently high load web server. Go forth and measure, but measure useful information.
Of course I'm going to keep doing this, but if you say that my test is invalid, then all of the tests people did to justify epoll are invalid.
Any complexity introduced in the code increases its long term maintenance cost. I strongly suspect this is one of those cases where the performance gain will not justify the long-term effort of maintaining a more complex architecture.
I remember having a similar discussion circa 1992 about advantages and disadvantages of using ODBC versus native MS SQL/Sybase libraries. I instrumented the program I was writing and showed it spent 99+% of the time idling, 1-% of the time computing and, of that time, about 78% waiting for the database to return something. Using native libraries would yield a minuscule improvement at the cost of a huge headache.
The worst two things that afflict programmers today is:
1. They never update their information, even after 18 years (18! You realize that right!? Things change man!)
2. They have an irrational paranoia about trying new things, as if me trying this out is going to destroy the universe.
That really needs to change.
Measure it or STFU.
It doesn't necessarily depend on dial-up, either. Imagine the number of people who leave bittorrent open in the background, stream porn, or whatever else that leaves their individual HTTP connections slow. Hell, latency alone (it takes at least a second for my connection to reach the east coast of the USA) would have an effect, and you can't underestimate the increasing number of mobile devices on slow(-ish, depending on congestion) 3G networks.
I'd provide statistics from my server (I serve an NZ gaming community), but I suspect my numbers would be disproportionate compared to the average workload. Here in NZ, we have far more people on crappy pipes (our DSL network is, famously, a gigantic pile of shit - although that has improved over the past couple of years and continues to), and far less people on smartphones (iPhones cost ~$800USD here).
Still, I believe the commenter has a point which you shouldn't ignore, or at least shouldn't pass off so easily :). I'd love to do some testing myself, but unfortunately between working a day-job, and spending my evenings trying to get a startup off the ground, I've got no time spare.
Measure it or GTFO.
It helps if you're going to comment that you actually read the words I use, not the ones you have in your head that make you sound like you're super smart.
Because unlike you, I actually go do shit rather than spout off in a comment thread.
It looks like Jacques looks has a pretty good start at making these measurements: http://news.ycombinator.com/item?id=1573145. If the numbers he provides aren't helpful, or aren't complete, you might try encouraging him to fix them. Calling him a "FUD slinging troll" seems more likely to cause him to tune out and ignore you, to the detriment of us all.
Realize that you've been thinking about this problem for a while now, while others have just started their thought process. Your goal is to get them up to speed so they can move your argument further, but this won't be instantaneous. Treating them as potential allies during this formative stage might pay off. If you can hold off with the insults for a couple hours or even days, you might get better results than if you shout them down immediately. :)
Very first thing you did was immediately reply to every branch of the
comments with your agenda.
You're suffering from paranoia. If you post an article about security, you can bet tptacek is all over the comments, informing and correcting people. In this case, the article was about something jacquesm happens to know a bit about, so he participates actively in the comments. To suggest he is pushing an 'agenda' is ridiculous: there's nothing at stake for him. The only thing he tries to do is help you, by noting that he thinks you have overlooked something. I actually have no idea what your problem is [..]
That's because he doesn't have a problem: it's your mind that's filling in the blanks. It suggests that while writing the article, you were already sure people would challenge you based on 'religious conviction' instead of on fact. jacquesm's point was a simple, critical question: what are actual real-world ATR's for the servers that Mongrel2 should be able to replace?Allow me to make an observation of a psychological nature: you are thoroughly miffed that it was so easy for someone to provide possibly devastating criticism to an idea about which you started caring WAY to much. What you should realize it nobody thinks lesser of you because of that criticism: the article is still interesting and provides a sound basis. There is no reason to react in such a aggressive way; it's even counterproductive.
So far all you've got is trolling HN comments. YOU WIN!
If you have tested this on real live servers then there is no evidence of that in your posting, and to suggest that this:
is anything but a localhost test is simply bogus.
The only use case where you may be right that poll is advantageous as far as I can see is streaming media servers (video, audio, other large files), image servers are the ones with the worst active-to-total ratios, especially if the images are small. I should know, I only serve up a few billion of them every day. A few years ago or so I was stupid enough to think that video was hard, man was I ever wrong. Repeated connections to the same host, that's a much bigger killer than pumping bits.
But what's really amazing is this is the test the proponents epoll have been using for 8 years. Where was your objection back when they were using it for that?
I'll trust that you accurately measured the ATR boundary between poll and epoll in your specific synthetic benchmark. That's then completely undermined by your handwaving in this thread about what ATR looks like in the wild, and the lack of any way for us to relate your microbenchmark with the real world.
But hey, you can live in your own little fantasy world where you think you've won some kind of battle of the HN because you listened to some epoll fanatic weirdo and cheered him on.
Creating a million tcp connections from one host is non-trivial.
The key words being "from one host". With a single client machine connecting to a single server endpoint, the (src ip, src port, dest ip, dest port) is reduced to being unique only on src port (from the client's perspective), so that's where the 65k limit, and the need for more IPs to do that, comes from. Using multiple source IPs on the same machine is like using multiple client hosts.
...using IPV4 there is a hard limit of the short integer used to indicate the port number which automatically limits you to 65536 connections (actually a few less, usually you'll lose 3 for stdin,stdout and stderr (which you can close to reuse them) and one for the listen socket).
The file descriptor limit is independent of the 65k total possible source ports. The source port limit is part of TCP/UDP. The file descriptor limit is set by ulimit (nofile in limits.conf) on a per-process basis and in /proc for system-wide. If you need more file descriptors, you can reuse 0, 1 and 2, but that's going to free up some ports so a single process can make more connections to the same server endpoint.
edit: You may be limited by number of open sockets or file handles. It's likely a per-process limit. Google or some linux guru could help you track down what limit it actually is, but it's not the number of server ports available. It might be a number you could raise.
I should go and do some testing to see what's causing this, you make me feel like the solution is right around the corner.
re. your edit, ulimit will happily raise the number > 64K, all the /proc/* settings seem to be ok so that's not it, it has to be some other layer in the stack that causes this. I'll definitely spend some time on this, it's been bugging me for a long time.
edit2: there seems to be a max_user_watches upper limit to what epoll will handle.
But you guessed wildly about what ATRs people see in the real world: http://news.ycombinator.com/item?id=1572292 http://news.ycombinator.com/item?id=1572418
Incidentally, this is the same test everyone else uses, so if you thought it was bullshit why did you support it when people testing epoll with it were using it? Oh, because they used it to confirm your bias rather than disagree with it.
a) Your controls were right, and
b) What the most optimal decision is in light of this new information.
(a) is well known to be one of the most difficult parts of scientific reasoning and is almost always open to endless debate and improvement. In short, it's the question of whether ATR is a human-sensible metric. (b) however has an interesting direct answer: figure out the distribution of "live" ATRs on an interesting population of real servers and then, to borrow Eliezer's phrase, shut up and multiply.If a lot of servers that you're targeting with M2 fall across that 60% divide (under circumstances similar to your controlled microbenchmark) then of course Superpoll is a good compromise.
Jacques is arguing a combination of (a) and (b). Perhaps ATR is not a sufficient metric to understand all interesting server loads. Moreover, perhaps many interesting servers live at really low or high ATRs all the time and so Superpoll must gracefully degrade to either poll or epoll.
In any case, driving for empirical data is noble, but possessing data is never sufficient to whitewall all detractors. It's really nice to have strong empirical support for the breakeven point between the two (ie. the ratio of their constant time components) via your benchmark, but science isn't just statistics.
(edit: I'll also add that pushing the pipetest microbenchmark past where people are usually making hyperbolic claims is a pretty big deal and a good catch.)
One problem with getting such numbers is that real world, the load is given, not the ATR, and the ATR for a given load may depend on implementation choices. Total connections is a function of server latency, and on a loaded system, latency can be a function of polling method. So switching methods may change your ATR.
On the other hand maybe ATR for a given load doesn't change significantly depending on implementation. Your test found what is best for a given ATR, but not necessarily for a given load, or how ATR depends on load for a given implementation. Depending on the results, you may want to add some hysteresis to superpoll.
2) The experience from 1992 still seems current. Adding complexity to any software project adds cost to maintain it in the future. My experience in 1992 showed how added complexity for a marginal performance gain did not pay off then and still won't pay off today (unless you are programming a computer so expensive even a marginal increase in performance means lots of money).
3) It's your project and you may do with it whatever pleases you. What I wrote was intended as friendly advice from someone who is in this business for a long time. You are, of course, free not to accept the advice.
4) I encourage you to try new things and I am usually the first to propose workload-adaptable solutions. I, however, had my share of extremely clever optimizations that bit me back later when things as subtle as processor caches changed and it's not very funny (albeit it is fun to dig deep in the system to find out why X runs 33% slower on the 50% faster box). Nowadays, I consider every program line not written a line gained.
With all due respect, this seems a ridiculous question. Do you obtain written statements to the above effect from all maintainers of software (open source or otherwise) before using it? Yes, it'll have to be maintained, but Zed's article alone is more documentation than you could ever hope for from most programmers. It's clear he isn't "most programmers" but that's no reason to hold him to ridiculous standards.
Again, it's his project and he's free to do whatever he feels like with it. Heck... I am not even a user. I offered him advice he is free to disregard. The fact my cleverness has been biting me ever since my 6502 days (quite likely before Zed was born) is a problem I sort of learned to deal with long ago - but it still comes to bite me from time to time.
Your hypothesis is that web servers have the majority of their fds active most of the time, and that's where the problem lies. I've put up the numbers elsewhere in this thread, feel free to do some measurement on your own high traffic websites to come up with more data points.
Aha! Totally wrong. My hypothesis has not been that at all. You totally didn't even understand the hypothesis, and I stated it very clearly. My hypothesis has always been:
epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.
Nowhere in there do I say that all web servers have the majority of their FDs active. NOWHERE. I say some might, I say who knows, I say we need to go measure, but nowhere do I say anything like what you say.
My hypothesis has always been:
epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is
faster than epoll when the active/total ratio is > 0.6.
That wasn't your hypothesis: that was your intermediate conclusion after the first tests. Your hypothesis was: the common knowledge that epoll yields better performance, and that I should obviously use epoll for Mongrel2, is wrong. [1]It's obvious he didn't mean 'hypothesis', but 'implicit assumption'. If you implement superpoll, you implicitly assume it will be useful. It will only be useful if actually deployed Mongrel2 servers will have an ATR > 0.6 at least some of the time.
[1] You can replace 'is' by 'may be', if you feel the strong version puts words in your mouth. It doesn't, because the hypothesis for an experiment may also be "The half life of protons is shorter than a trillion years", when I expect to reject that hypothesis. It's not an assertion of your opinion, but a statement of a fact you intend to accept or reject based on the outcome of the experiment.
Seriously, I bet the operating system you're using relies on far more of that evil "cleverness" than the subject of this thread.
This whole thread reeks of thinly veiled ad hominem attacks to me.
What I said, and repeat, is that if you want to introduce an expensive to maintain piece of code, you have to weight the added cost against the performance gain. In this case, the performance gain seems marginal, the assumption of usage envisioned seems wrong and the added complexity seems just pointless.
It's his project, his code and I am not even a Mongrel user. I offer this as a friendly piece of advice, from old programmer to young programmer.
You know: it doesn't matter if you are beam-racing a 6502 or writing networking code to run on 64-bit deeply pipelined processors, there are things that remain true. This is one of them.
"That's just plain wrong. Premature optimisation does not refer to having to measure before you optimise, it refers to optimising things that in practice may have little or no effect on the actual performance of the program."
No, that's just plain wrong. Premature optimisation is actually implementing something convoluted thinking it's optimized without knowing whether it actually is or not. It's voodoo cargo cult science. It's going against occam's razor.
There's nothing in Mongrel2 that's premature optimized. It's all very simple algorithms chosen for the right task, and later on I'll be testing them to see if they're still right. So your claim that this is premature optimization is just a buzzword and completely offensive. I took a long time to actually test my ideas before implementing them.
That's the total inverse of premature optimization.
What is total voodoo junk science is most of what you say. So far I haven't seen one set of data or any scientific experimental design or even a single testable hypothesis backing what you claim. It's all just rhetoric.
Until you've got hard numbers backing what you say, everything you're saying is inferior to what I'm doing: science.
After all, that is where the rubber meets the road and it would be a very easy way to determine if your hunch is right or not.
Epoll was specifically created with that sort of workloads in mind, your 'surprising' conclusion is not rooted in the fact that epoll is somehow behaving in a way that is contrary to expectation, in fact it behaves exactly as it is designed to do.
Benchmarking it like this is nothing like the real world, and that's where epoll shines, not when you test it the way you just did.
As for the numbers, we're serving about 10Gbps continuously using a combination of varnish and java code to several million uniques daily, html, images, video. Poll over epoll is a run race, as far as I'm concerned you're wasting your time with this.
But by all means, ignore all this and do what you have to, those are the lessons learned best anyway, and it's your time, not mine.
If you feel like getting another view on this I'd suggest to contact the author of Varnish, he really knows his stuff and he might be able to convince you where I can not.
If you're going to complain about science, at least understand how it works.
I sort of agree with him, except with an important detail: this question is basically about prioritization of your time, and I'd say that this is nobody's business but yours! You can optimize memcpy() all day, for all I care. ;-)
There's one aspect where you'd be quite right to do some investigation before implementing: if the outcome changes the interface you'd need to implement.
For example, here's an hypothesis: using epoll's edge-triggered mode could drastically reduce the number of events (since you only get an event when an fd becomes readable/writable the first time, instead of every time it's in that state). Since epoll is O(N) on the number of events returned (not on the number of fds that are currently readable/writable), you'd lower the effective ATR a whole lot. In fact, a really busy server would have fewer events, since a readable fd would stay that way for longer if data is received at a great rate (the write-side story might be less brilliant). You'd also have to do much fewer calls to epoll_ctl, since you could just stop caring about the reading side while you're trying to write the last batch of data on the other side (no need to remove it from the interest set, you won't get events for it). You only need to set it when flipping from read to write, and the other way (after receiving/sending headers and bodies).
Now, if that's true, that's a big deal, because now you have to change your design a fair bit. You have to remember that an fd is readable until you get EAGAIN from read() yourself, so there's some more state management, moving that fd from one list to another, etc. Finding out that this would be a million times faster (or slower!) now would save you a ton of work, either way.
But finding whether poll or epoll is faster, or an hybrid solution with the same interface? Meh, it could wait.
(about my hypothesis, that's actually what Davide Libenzi designed epoll for, which might explain some of the weirder bits)
That's half true - it doesn't hold for low ATR traffic (lots of hanging connections, clients that GET something, spend time elsewhere while in the meantime the browser keeps the connection alive). In short, there's nothing typical about it because, while those two kinds of loads have been studied extensively in both bibliography and practice, their combination and the practical consequences are not well understood, afaik. Links to relevant studies are more than welcome, of course.
Which, in reality is, "I'll spend a lot of design and implementation effort designing a new one which may or may not improve the measurable, global performance of my new web server because it's not yet at the point where I can benchmark these sorts of things to verify that I'm not wasting a whole ton of effort that could be better spent by deciding that epoll is fast enough."
Maybe Zed knows from his previous server experience that {e}poll is where he hits a bottleneck; it's just that if there's any chance that it's not, he could be wasting a bunch of time implementing "superpoll".
(Or maybe he just wants to do it because it's neat, or because it's innovative (which it is), or for any number of other reasons. I'm just pointing out that he's doing much more than picking "the best one for the environment")
What you really should be getting from it though is that epoll is not faster. It is not O(1). It is not faster on smaller vs. larger lists of FDs. Pretty much all the things you were told as advantages of epoll are total crap.
The only advantage of epoll is it's O(N=active) when poll is O(N=total). That's it.
So at a minimum I've done some education and spent some time learning something.
Zed is clearly out to change the world and I would very much like him to succeed but he just seems to be missing the obvious here, which is that idle connections are the ones to worry about (because they're very expensive!) so his benchmark at this point in time is useless.
But hey, if I don't make it back alive from my complete dangerous experiment in disproving that epoll is always the way to go always you can come get me. Bring a big gun because this stuff is so scawy and howwible I might not make out alive.
STOP
For most things, it doesn't matter. A filesystem is a filesystem.
You only need to make decisions like that when you properly measure and decide that X may be a bottleneck, or you need features that Y has but X doesn't.
Premature optimization is a sin, but making design decisions you know from experience has a major impact is not, as long as you measure to confirm afterwards.
We don't design by throwing dice - most decisions we make are based on experience or assumptions about what works and what doesn't. Measurements are important to challenge those assumptions, but it doesn't take away the value of making use of experience to create a reasonable baseline.
Where "premature optimization" comes in is where you start expending unnecessary extra effort to implement a more complicated solution without evidence to back up the need for it.
Spending a little bit of time to think through the requirements for major aspects of your system is not extra effort.
I just wanted to say that it is not an unquestionable design decision.
Rock on with the superpoll, I hope it's awesome and very successful.
- underwater ajax requests
- regular website content (images, dynamic html, css and other relatively small (say < 250K) files)
- media servers (filedumps, video servers, streaming audio servers)
Each of those requires fairly specific tuning of the TCP stack to get the most out of it, so you're not likely going to find all of these on one and the same machine unless it is a small operation (and in that case this whole discussion is moot).
A benchmark done in isolation is meaningless because in the end, real world traffic is what it is all about. So, I personally don't care whose site(s) you test with, as long as there are enough of them to get a statistically valid result.
Google's or Yahoo's would be fine with me, I've given my results above, if I have the time I'll do the same thing on a couple of other high volume sites.
I've (unfortunately) studied this problem quite a bit because of the size of the websites that I'm involved with and so far I've learned that you can play around on your testbench all day long it doesn't matter one little bit for production purposes unless you are very careful (such as in that other test linked from this page) to simulate users clients.
You could do a lot worse than to play back a log file in order to make an experiment repeatable. I assume that real world performance is what Zed is after, not theoretical performance.
I mean, are you sure you didn't used to work for Microsoft and then got hired by Linus to work your FUD spreading magic?
Because "real world traffic" is a bullshit test [..]
The hell it is. A statistically significant sample of 'real world ...' is the foundation for most engineering decisions. When you build a bridge, you take the actual loads it has to support into account. Intel bases their chipbaking on the actual purity of the silicon their suppliers can provide.