Linux Pipes Are Slow

340 points by qsantos 1 year ago | 166 comments

One of my sideprojects is intended to address this: https://lwn.net/Articles/976836/

The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.

Would love to find collaborators for this one :)

phafu 1 year ago | |

At least for user space usage, I'm not sure a new kernel thing is needed. Quite a while ago I have implemented a user space (single producer / single consumer) ring buffer, which uses an eventfd to mimic pipe behavior and functionality quite closely (i.e. being able to sleep & poll for ring buffer full/empty situations), but otherwise operates lockless and without syscall overhead.

messe 1 year ago | |

> and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer

Is there planned to be a standardized way to signal to the other end of the pipe that ring buffers are supported, so this could be handled transparently in libc? If not, I don't really see what advantage it gets you compared to shared memory + a futex for synchronization—for pipes that is.

immibis 1 year ago | | |

Presumably the same interface still works if the other side is using read/write.

caf 1 year ago | |

Presumably ringbuffer_wait() can also be signalled through making it 'readable' in poll()?

koverstreet 1 year ago | | |

yes, I believe that's already implemented; the more interesting thing I still need to do is make futex() work with the head and tail pointers.

WhyNotHugo 1 year ago | |

I wonder if existing ring buffer interfaces will consider using this or if we'll see an xkcd927 situation. Regardless, this seems like an interesting endeavour.

wakawaka28 1 year ago | |

Buffering is there for a reason and this approach will lead to weird failure modes and fragility in scripts. The core issue is that any stream producer might go slower than any given consumer. Even a momentary hiccup will totally mess up the pipe unless there is adequate buffering, and the amount needed is system-dependent.

hackernudes 1 year ago | | |

I think the OP's proposal has buffering.

It is different from a pipe - instead of using read/write to copy data from/to a kernel buffer, it gives user space a mapped buffer object and they need to take care to use it properly (using atomic operations on the head/tail and such).

If you own the code for the reader and writer, it's like using shared memory for a buffer. The proposal is about standardizing an interface.

Spivak 1 year ago | | |

What makes this any different than other buffer implementations that have a max size? Buffer fills, writes block. What failure mode are you worried about that can't occur with pipes which are also bounded?

foota 1 year ago | | |

Maybe I misunderstand, but if the ring buffer is full isn't it ok for the sender to just block?

fatcunt 1 year ago |

> I do not know why the JMP is not just a RET, however.

This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.

https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...

> The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.

These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.

https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...

ndesaulniers 1 year ago | |

There are perhaps only a handful of kernel developers that:

1. would know the above

2. would choose such an obnoxious throwaway handle

michaelcampbell 1 year ago | | |

I believe there are a lot more of your 2nd point than you might think.

sk5t 1 year ago | | |

Have you accounted for the population of Australian kernel developers?

qsantos 1 year ago | |

Thanks a lot for the information! I was not quite sure what to look for in this case. I have added in note in the article.

0xbadcafebee 1 year ago |

Calling Linux pipes "slow" is like calling a Toyota Corolla "slow". It's fast enough for all but the most extreme use cases. Are you racing cars? In a sport where speed is more important than technique? Then get a faster car. Otherwise stick to the Corolla.

JoshTriplett 1 year ago |

This is a side note to the main point being made, but on modern CPUs, "rep movsb" is just as fast as the fastest vectorized version, because the CPU knows to accelerate it. The name of the kernel function "copy_user_enhanced_fast_string" hints at this: the CPU features are ERMS ("Enhanced Repeat Move String", which makes "rep movsb" faster for anything above a certain length threshold) and FSRM ("Fast Short Repeat Move String", which makes "rep movsb" faster for shorter moves too).

donaldihunter 1 year ago |

Something I didn't see mentioned in the article about AVX512, aside from the xsave/xrstor overhead, is that AVX512 is power hungry and causes CPU frequency scaling. See [1], [2] for details and as an example of how nuanced it can get.

[1] https://www.intel.com/content/dam/www/central-libraries/us/e...

[2] https://www.intel.com/content/www/us/en/developer/articles/t...

Narishma 1 year ago | |

That is only the case in specific Intel CPU models.

nitwit005 1 year ago |

Just about every form of IPC is "slow". You have decided to pay a performance cost for safety.

marcosdumay 1 year ago | |

You shouldn't have to pay that much. Pipes give you almost nothing, so they should cost almost nothing.

Specifically, there aren't many reasons for your fastest IPC to be slower than a long function call.

nitwit005 1 year ago | | |

If you don't think pipes offer much, don't use them.

Saying "long function call" doesn't mean much since a function can take infinitely long.

brigade 1 year ago | |

Pipes don’t exist for safety, they exist as an optimization to pass data between existing programs.

PaulDavisThe1st 1 year ago | | |

NOT writing and reading to and from a file stored on a drive is not, in this context, an optimization, but a significantly freeing conceptual shift that completely transforms how a class of users conducts themselves when using the computer.

nitwit005 1 year ago | | |

The safety is memory protection. If you don't care about memory protection, you can reduce IPC to passing pointers around.

qsantos 1 year ago |

I am again getting the hug of death of Hacker News. The situation is better than the last time thanks to caching WordPress pages, but loading the page can still take a few seconds, so bear with me!

RevEng 1 year ago |

I didn't quite grasp why the original splice has to be so slow. They pointed out what made it slower than vmsplice - in particular allocating buffers and using scalar instructions - but why is this necessary? Why couldn't splice just be reimplemented as vmsplice? I'm sure there is a good reason, but I've missed it.

Izkata 1 year ago | |

> Why couldn't splice just be reimplemented as vmsplice?

A possible answer that's currently just below your comment: https://news.ycombinator.com/item?id=41351870

> vmslice doesn't work with every type of file descriptor.

rwmj 1 year ago |

Be interesting to see a version using io_uring, which I think would let you pre-share buffers with the kernel avoiding some copies, and avoid syscall overhead (though the latter seems negligible here).

qsantos 1 year ago | |

That sounds like a good idea!

rwmj 1 year ago | | |

I'm not claiming it'll be faster! Additionally io_uring has its own set of challenges, such as whether it's better to allocate one ring per core or one ring per application (shared by some or all cores). Pre-sharing buffers has trade-offs too, particularly in application complexity [alignment, you have to be careful not to reuse a buffer before it is consumed] versus the efficiency of zero copy.

stabbles 1 year ago |

A bold claim for a blog that takes about 20 seconds to load.

yas_hmaheshwari 1 year ago | |

This post has gone to the top of hacker news, so I think we should give him some slack

Looks like an amazing article, and so much to learn on what happens under the hood

ben-schaaf 1 year ago | | |

HN generates ~20k page views over the course of a day with a peak of 2k/h: https://harrisonbroadbent.com/blog/hacker-news-traffic-spike.... At ~1MB per page load - not sure how accurate this is, I don't think it fully loaded - this static blogpost requires 0.55MB/s to meet demand. An original raspberry pi B (10mpbs ethernet) on the average french mobile internet connection (8mbps) provides double that.

I don't mean this as a slight to anyone, I just want to point out the HN "hug of death" can be trivially handled by a single cheap VPS without even breaking a sweat.

wvh 1 year ago | |

I believe that when it's a .fr, they call it nonchalance...

Borg3 1 year ago |

Haha. When I read the title I smiled. Linux pipes slow? Moook.. Now try Cygwin pipes. Thats what I call slow!

Anyway, nice article, its good to know whats going on under the hood.

MaxBarraclough 1 year ago | |

I'd assumed Cygwin pipes are just Windows pipes, is that not the case?

tyingq 1 year ago | | |

Not a comprehensive list of problems, and not current but a good illustrative post of the kind of issues that people have run into in this post:

https://cygwin.com/pipermail/cygwin-patches/2016q1/008301.ht...

Borg3 1 year ago | | |

Its not that easy. Yeah, they are, but there is a lot of POSIX like glue inside so they work correctly with select() and other alarms. Code is very complicated.

But still, kudos for Cygwin Developers for creating Cygwin :) Great work, even tho it have some issues.

faizshah 1 year ago |

This is a really cool post and that is a massive amount of throughput.

In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.

I’m trying to think of other applications this could be useful for. Maybe video workflows?

sixthDot 1 year ago |

> I do not know why the JMP is not just a RET, however.

The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/inclu...

[2]: https://stackoverflow.com/a/60579385

yencabulator 1 year ago |

FUSE can be a bit trickier than a single queue of data chunks. Reads from /dev/fuse actually pick the right message to read based on priorities, and there's cases where the message queue is meddled with to e.g. cancel requests before they're even sent to userspace. If you naively switch it to eagerly putting messages into a userspace-visible ringbuffer, you might significantly change behavior in cases like interrupting slow operations. Imagine having to fulfill a ringbuf worth of requests to a misbehaving backend taking 5sec/op, just to see the cancellations at the very end.

nyanpasu64 1 year ago |

How do you gather profiling information for kernel function calls from a user program?

qsantos 1 year ago | |

I'll write an article on the flamegraphs specifically, but to get the data, just follow Julia's article!

https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/

ismaildonmez 1 year ago | | |

Could you clarify how are you testing the speed of the first example where you are not writing anything to stdout? Thanks.

jvanderbot 1 year ago |

> Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well.

I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.

arendtio 1 year ago |

I know pipes primarily from shell scripts. Are they being used in other contexts as extensively, too? Like C or Rust programs?

guenthert 1 year ago | |

Most shells are C programs, so it's clearly possible to use pipes there (and consequently any language with a C FFI, including Rust) and it's done. It's cumbersome though. That's why there are so called glue languages, including shells. Compiler and similar tools might establish pipes and will do so directly, rather than via a glue language. Perhaps a grep through github could tell how 'extensive' this truly is.

For the data transfer rate it doesn't matter how (using which language) the pipe is established; C and Rust and the like will have a (small) edge up in the start-up time (latency) though.

up2isomorphism 1 year ago |

Someone tasted a bread thinking it is not sweet enough, which is fine. But calling the bread bland is funny because it does not mean to taste sweet.

jeremyscanvic 1 year ago |

Great post! I didn't know about vmsplice(2). I'm glad to see a former ENSL student here as well!

qsantos 1 year ago | |

Hey!

goodpoint 1 year ago |

Excellent article even if, to be honest, the title is clickbait.

chmaynard 1 year ago | |

Agreed. Titles that don't use quantifiers are almost always misleading at best.

mparnisari 1 year ago |

I get PR_CONNECT_RESET_ERROR when trying to open the page

qsantos 1 year ago | |

My server struggles a bit with the load on the WordPress site. You should be fine just reloading. I will make sure to improve things for the next time!

cowsaymoo 1 year ago |

What is the library used to profile the program?

tzury 1 year ago | |

https://linux.die.net/man/1/pv

it is in the pipe command `... | pv > /dev/null`

throw12390 1 year ago | | |

`pv --discard` is faster by 8% (on my system).

  % pv </dev/zero >/dev/null
  54.0GiB/s

  % pv </dev/zero --discard
  58.7GiB/s

djaouen 1 year ago |

So is Python, but I'm still gonna use it lol

jheriko 1 year ago |

just never use pipes. they are some weird archaism that need to die :P

the only time ive used them is external constraints. they are just not useful.