Linux Pipes Are Slow(qsantos.fr) |
Linux Pipes Are Slow(qsantos.fr) |
The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.
Would love to find collaborators for this one :)
Is there planned to be a standardized way to signal to the other end of the pipe that ring buffers are supported, so this could be handled transparently in libc? If not, I don't really see what advantage it gets you compared to shared memory + a futex for synchronization—for pipes that is.
It is different from a pipe - instead of using read/write to copy data from/to a kernel buffer, it gives user space a mapped buffer object and they need to take care to use it properly (using atomic operations on the head/tail and such).
If you own the code for the reader and writer, it's like using shared memory for a buffer. The proposal is about standardizing an interface.
This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...
> The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.
These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...
1. would know the above
2. would choose such an obnoxious throwaway handle
[1] https://www.intel.com/content/dam/www/central-libraries/us/e...
[2] https://www.intel.com/content/www/us/en/developer/articles/t...
Specifically, there aren't many reasons for your fastest IPC to be slower than a long function call.
Saying "long function call" doesn't mean much since a function can take infinitely long.
A possible answer that's currently just below your comment: https://news.ycombinator.com/item?id=41351870
> vmslice doesn't work with every type of file descriptor.
Looks like an amazing article, and so much to learn on what happens under the hood
I don't mean this as a slight to anyone, I just want to point out the HN "hug of death" can be trivially handled by a single cheap VPS without even breaking a sweat.
Anyway, nice article, its good to know whats going on under the hood.
https://cygwin.com/pipermail/cygwin-patches/2016q1/008301.ht...
But still, kudos for Cygwin Developers for creating Cygwin :) Great work, even tho it have some issues.
In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.
I’m trying to think of other applications this could be useful for. Maybe video workflows?
The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.
[1]: https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.
For the data transfer rate it doesn't matter how (using which language) the pipe is established; C and Rust and the like will have a (small) edge up in the start-up time (latency) though.
https://linux.die.net/man/1/pv
it is in the pipe command `... | pv > /dev/null`
% pv </dev/zero >/dev/null
54.0GiB/s
% pv </dev/zero --discard
58.7GiB/sthe only time ive used them is external constraints. they are just not useful.
Because of that, it is economical to spend lots of time optimizing it, even if it only makes the code marginally more efficient.
Pipes aren't used everywhere in production in hot paths. That just doesn't happen.
If 100 million people each save 1 cent because of your work, you saved 1 million in total, but in practice nobody is observably better off.
I've used pipes for a lot of stuff over 10+ years, and never noticed being limited by the speed of the pipe, I'm almost certain to be limited by tar, gzip, find, grep, nc ... (even though these also tend to be pretty fast for what they do).
1. Logging. At first our tools for reading the logs from a filesystem management program were using pipes, but they would be overwhelmed quickly (even before it would overwhelm pagers and further down the line). We had to write our own pager and give up on using pipes.
2. Storage again, but a different problem: we had a setup where we deployed SPDK to manage the iSCSI frontend duties, and our component to manage the actual storage process. It was very important that the communication between these two components be as fast and as memory-efficient as possible. The slowness of pipes comes also from the fact that they have to copy memory. We had to extend SPDK to make it communicate with our component through shared memory instead.
So, yeah, pipes are unlikely to be the bottleneck of many applications, but definitely not all.
Lets not get carried away. You can use ffmpeg as a library and encode buffers in a few dozen lines of C++.
It's clumsier, to be sure, but if performance is your goal, the socket should be faster.
Donald Knuth thinks the same: https://en.wikipedia.org/wiki/Program_optimization#When_to_o...
https://www.toyota.com/grcorolla/
(These machines have amazing engineering and performance, and their entire existence is a hack to work around rules making it unviable to bring the intended GR Yaris to the US market.. Maybe just enough eng/perf/hack/market relevance to HN folk to warrant my lighthearted reply. Also, the company president is still on the tools.
Suppose you're cycling on the lines of stdout and need to use sed, cut and so on, using pipes will slow down things considerably (and sed, cut startup time will make things worse).
Using bash/zsh string interpolation would be much faster.
Also, why leave performance on the table by default? Just because “it should be enough for most people I can think of”?
Add Tesla motors to a Toyota Corolla and now you’ve got a sportier car by default.
it's not optimizing footprint or speed of application. it's optimizing the resources and speed of development and deployment
All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...
And they are not final, i. e. Noah Goldstein still updates them every year.
https://github.com/llvm/llvm-project/blob/main/libc/src/stri...
On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.
However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.
At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.
The Intel CPUs have different behaviors.
On my Zen 3 CPU, for lengths of 2 kB or smaller it is possible to copy faster than with "rep movsb", but by using SIMD instructions (or equivalently the builtin "memcpy" provided by most C compilers), not with a C loop (unless the compiler recognizes the C loop and replaces it with the builtin memcpy, which is what some compilers will do at high optimization levels).
vmslice doesn't work with every type of file descriptor. eschewing some technology entirely because it seems archaic or because it makes writing "the fastest X software" seem harder is just sloppy engineering.
> they are just not useful.
Then you have not written enough software yet to discover how they are useful.
Nothing ever touches those pages on the consumer side and they can be refused immediately.
If you actually want a functional program using vmsplice, with a real consumer, things get hairy very quickly.
Sure, you could build that box with glue and clamps and ample time, sure it would look neater and weigh less than the version that's currently holding you imprisoned and if done right it will even be stronger but it takes more time and effort as well as those glue clamps and other specialised tools to create perfect matching surfaces while the builder just wielded that hammer and those nails and now is building yet another utilitarian piece of work with the same hammer and nails.
Sometimes all you need is a hammer and some nails. Or pipes.
It's incredibly valuable on the day to day.
If you dislike their (relative) slowness, it's open source, you can participate in making them faster.
And I'm sure that after this HN post we'll see some patches and merge requests.
Personally I think there's much worse ugliness in POSIX than pipes. For example, I've just spent the last couple of days debugging a number of bugs in a shell's job control code (`fg`, `bg`, `jobs`, etc).
But despite its warts, I'm still grateful we have something like POSIX to build against.
In fact if you ever set O_NONBLOCK on a pipe you need to be damn sure both the reader and writer expect non-blocking i/o because you'll get heisenbugs under heavy i/o when either the reader/writer outpace each other and one expects blocking i/o. When's the last time you checked the error code of `printf` and put it in a retry loop?
Not sure what printf has to do with, it isn't designed to be used with a non-block writer (but that only concerns one side). How will the reader being non-block change the semantics of the writer? It doesn't.
You can't set O_NONBLOCK on a pipe fd you expect to use with stdio, but that isn't unique to pipes. Whether the reader is O_NONBLOCK will not affect you if you're pushing the writer with printf/stdio.
(This is also a reason why I balk a bit when people refer to O_NONBLOCK as "async IO", it isn't the same and leads to this confusion)
That said, most Aussie kernel devs I know are IBM PPC Ozlabs folks that are actually really nice!
But then the solutions are not comparable anymore, are they? Would a lossless codec instead have improved speed?
To implement job control, there are several signals you need to be aware of:
- SIGSTSP (what the TTY sends if it receives ^Z)
- SIGSTOP (what a shell sends to a process to suspend it)
- SIGCONT (what a shell sends to a process to resume it)
- SIGCHLD (what the shell needs to listen for to see there is a change in state for a child process -- this is also sometimes referred to as SIGCLD)
- SIGTIN (received if a process read from stdin)
- SIGTOU (received if a process cannot write to stdout nor set its modes)
Some of these signals are received by the shell, some are by the process. Some are sent from the shell and others from the kernel.
SIGCHLD isn't just raised for when a child process goes into suspend, it can be raised for a few different changes of state. So if you receive SIGCHLD you then need to inspect your children (of course you don't know what child has triggered SIGCHLD because signals don't contain metadata) to see if any of them have changed their state in any way. Which is "fun"....
And all of this only works if you manage to fork your children with special flags to set their PGID (not PID, another meta ID which represents what process group they belong to), and send magic syscalls to keep passing ownership of the TTY (if you don't tell the kernel which process owns the TTY, ie is in the foreground, then either your child process and/or your shell will crash due to permission issues).
None of this is 100% portable (see footnote [1]) and all of this also depends on well behaving applications not catching signals themselves and doing something funky with them.
The bug I've got is that Helix editor is one of those applications doing something non-standard with SIGTSTP and assuming anything that breaks as a result is a parent process which doesn't support job control. Except my shell does support job control and still crashes as a result of Helix's non-standard implementation.
In fairness to Helix, my shell does also implement job control in a non-standard way because I wanted to add some wrappers around signals and TTYs to make the terminal experience a little more comfortable than it is with POSIX-compliant shells like Bash. But because job control (and signals and TTYs in general) are so archaic, the result is that there are always going to be edge case bugs with applications (like Helix) that have implemented things a little differently themselves too.
So they're definitely not easy to use and can break in unexpected ways if even just one application doesn't implement things in expected ways.
[1] By the way, this is all ignoring subtle problems that different implementations of PTYs (eg terminal emulators, terminal multiplexors, etc) and different POSIX kernels can introduce too. And those can be a nightmare to track down and debug!
Linux is optimizing sockets with a similar goal. And it's quite far on that direction. But there's still some margin to gain.
The tradeoffs you're discussing are considerations. Is it worth making a ubiquitous thing faster at the expense of some complexity? At some point that answer is "yes", but that is absolutely not "When it's easy and has a huge benefit". The most important optimizations you personally benefit from were not easy OR had a huge benefit. They were hard won and generally small, but they compound on other optimizations.
I'll also note that the Knuth quote you reference says exactly this:
> Yet we should not pass up our opportunities in that critical 3%
https://www.highpowermedia.com/Archive/the-surge-tank
https://forums.tdiclub.com/index.php?threads/air-tank-or-com...
It's such an obvious idea that I'm kind of shocked it took them until 2003 to do it. Surely someone thought of this in like the 60s.
I would probably do it differently with a separate supercharger to intermittently maintain another 1-2+ bar of boost to make the tank less than half as large, but that would add complexity, and what do I know.
But for pipes what it means is that if whoever is reading or writing the pipe expects non blocking semantics, the other end needs to agree. And if they don't you'll eventually get an error because the reader or writer outpaced the other, and almost no program handles errors for stdin or stdout.
I just wrote up a test to be sure: in the process with the read side, set it to non-blocking with fcntl(p, F_SETFL, O_NONBLOCK) then go to sleep for a long period. Dump a bunch of data into the writing side with the other process: the write() call blocks once the pipe is full as you would expect.
CVTs shouldn't even have a concept of "speeds". I absolutely hate how manufacturers will build cars with CVTs and then make them only go into discrete gear ratios. It completely destroys the entire reason for having a CVT.
I understand that they do it because people don't like how CVTs sound/feel, but maybe they should all have 3 modes:
1. Eco - optimizes gear ratio for maximum effeciency
2. Performance - optimizes for maximum power
3. Sport - pretends to be a normal transmission for a better "feel".
If there wasn't a problem to solve they wouldn't have said anything. If you want something different you have to do something different.
https://en.m.wikipedia.org/wiki/Synchronous_grid_of_Continen...
Also, if you micro-optimize and that becomes your whole focus and ability to focus, your business is unable to innovate aka traverse the economic landscape and find new rich gradients and sources of "economic food", making you a dinosaur in a pit, doomed to eternally cannibalize on what other creatures descend into the pit and highly dependent on the pit not closing up for good.
I admit my opinion is not based on first hand knowledge, but I have for years worked on projects trying to address poverty at different parts of this planet and can't think of a single one where this would be even remotely true.
My opinion, however, is based on first-hand knowledge. I've been the kid saving those pennies, and I've worked with those kids. I understand that in the vast majority of cases, an extra penny does nothing more. That isn't what your original comment above claimed, nor is it what you've claimed here. My counterexample is enough to demonstrate the falsehood. Arguing that there are better ways to distribute these pennies is another matter, and I take that seriously as well.
Assuming a wage of $35/hour, each second is worth 1 cent. To save 1 cent you only need to reduce the time spent waiting for computers by a second across the entire lifetime of that person.
Now here is the beauty of this. There isn't just a single guy out there doing this. There are hundreds of thousands of people, possibly millions, doing it.
If society was a giant hivemind, then economic viability would take precedence over personal profit. Meanwhile if society is a bunch of isolated individuals, economic viability would take the backseat. So this tells us more about the limits of human psychology than it tells us about economics.
Obviously not if you are doing for your own fun or just improving the state of art.
You don't need the effect to be observable on an individual level
It's something that is worth an engineer's time
The average human life expectancy is 77.5 years, or 2.4457e+9 seconds. If you divide that by, say, 1 billion daily active users of Google, you get 2.445. So if you work at Google, and optimize a slow process, and save every user 1 second, once, you've saved 2 lives. If you're a Microsoft and make boot up take 1 second less across their billion or so devices, same thing.
You are a few logical layers removed, but fundamentally that is at the heart of this. It isn't just about what you think can or can't be leveraged. Reducing waste in a centralized fashion is excellent because it will enable other waste to be reduced in a self reinforcing cycle as long as experts in their domain keep getting the benefits of other experts. The chip experts make better instructions, so the library experts make better software libs they add their 2% and now it is more than 4%, so the application experts can have 4% more theoughput and buy 4% fewer servers or spend way more than 4% less optimizing or whatever and add their 2% optimization and now we are at more than 6%, and the end users can do their business slightly better and so on in a chain that is all of society. Sometimes those gains are mututed. Sometimes that speed turns into error checking, power saving, more throughput, and every trying to do their best to do more with less.