>There are basically two types of polling on the block side. One takes care of interrupt mitigation, so that we can reduce the IRQ load in high IOPS scenarios. That is governed by block/blk-iopoll.c, and is very much like NAPI on the networking side, we've had that since 2009 roughly. It still relies on an initial IRQ trigger, and from there we poll for more completions, and finally re-enable interrupts once we think it's sane to do so. This is driver managed, and opt-in.
>The new block poll support is a bit different. We don't rely on an initial IRQ to trigger, since we never put the application to sleep. We can poll for a specific piece of IO, not just for "any IO". It's all about reducing latencies, as opposed to just reducing the overhead of an IRQ storm.
To answer your second question, this doesn't have anything to do with disk changes/inotify/etc that a program would use. My understanding is that currently \many\ IO devices respond to the kernel's request for data by triggering an interrupt that then takes time for the kernel to get to for reading. The interrupt process can be a bit slow leading to latency when waiting for the disk to respond. The new system, rather than waiting for an interrupt, continuously checks the driver for new data and as it doesn't rely on an interrupt can achieve far better latency. Lower latency means more operation per second.
With spinning disks who are only able to do <200 operations per second with latencies around 5ms this won't have much of an effect but with SSD who are able to do >2000 operations per second with latencies around 0.5ms trimming off 0.1ms per operation (made that number up) via polling rather than waiting on an interrupt can mean about 20% more operations per second.
EDIT: Thinking a bit more about it, interrupts were introduced when CPU were much more slower than they are now. So, the tradeoff I'm thinking isn't that bad.
This change means that when the I/O load is high there is no longer one CPU interrupt per I/O operation, instead multiple operations are processed at once, so the CPU has more time to run user space applications to actually do something with all that data.
The original behavior was the OS/Hardware would tell your program it has data ready (hardware limited to hundreds of times/sec). This was changed to your program showing up occasionally with a large truck to load data (basically cpu limited).
The old way was perfectly fine for mechanical hard disks, but with SSDs they were running into limitations with how often you can process interrupts (think of them as a hardware level context switch, very expensive).
In short: I don't think this has anything to do with programs at all. This looks like it has nothing to do with userspace and I believe it's just the locking mechanism on the device.
The performance improvement for fast devices cited in a link on the article [1] are pretty dramatic, but I wonder about how slow the device needs to be before polling becomes a problem. That same link mentions that slow devices benefit, but, speculate that it may be due to the CPU not being able to go into a deeper sleep state.
In the worst case, you might spend more time and energy switching between power states than you actually spend in a lower power state.
But indeed, it does seem counter-intuitive even with that, as there are often power mode changes available that would pay off in just a few microseconds. It sounds to me like x86 may suffer from a limited IRQ system - there are other systems out there in which IRQ overhead is < 10 cycles.