Tutorial – Write a System Call

Tutorial – Write a System Call(brennan.io)

198 points by zerognowl 9 years ago | 16 comments

I decided to read ep1[0] too and I saw a picture "use all the memory". I don't know if it's funnier that I checked if you have an "alt" HTML tag or that you actually wrote the text from the picture. People with alt tags are MVPs. :)

[0] - https://brennan.io/2016/10/13/kernel-dev-ep1/

phillc73 9 years ago | |

Do you remember when mousing over a picture would show the alt tag in something resembling a tooltip? (Maybe it can still be set like this, but isn't default in Chrome or Firefox anymore.) I recall being quite amused sometimes at what people would write as their alt text.

prashnts 9 years ago | | |

I think the `title` attribute is used to show the tooltips.

> I recall being quite amused sometimes at what people would write as their alt text.

XKCD :)

brenns10 9 years ago | |

The alt text came from the link title in the markdown :)

thirdreplicator 9 years ago |

Thoroughly enjoyed the tutorial, but why would one want to make a custom system call? What superpowers does this give you? Thanks in advance for your answers.

geofft 9 years ago | |

It's your best interface with the kernel. It's simple and high-performance. It's specifically what you want if you want to pass structured data in-memory to the kernel.

In a strict technical sense, there's nothing you need a syscall for, you can just read/write data (or maybe do an ioctl) on a new device node or something. In fact, OpenAFS supports routing its "syscall" on Linux through ioctls on /proc/fs/openafs/syscall, because Linux makes it deliberately annoying to patch the syscall table from a kernel module so as to make life harder for rootkits.

However, it's simpler to pass data structures if you can use a syscall. It's much higher-performance than opening a file node. And if you expect to run in an environment where you don't know if a particular file will exist (e.g., a chroot), it's useful to use a syscall directly, because that's always available. For instance, getrandom was added in July 2014 partly for this reason, and partly so that if you ran out of file descriptors to open /dev/urandom you could still get randomness.

Here are all the syscalls added in the last two years:

* pkey_mprotect, pkey_alloc, pkey_free: support for a new Intel processor feature, Memory Protection Keys https://lwn.net/Articles/643797/

* preadv2, pwritev2: add a flags argument so you can do a non-blocking preadv or pwritev without opening the file in non-blocking mode https://lwn.net/Articles/670231/

* copy_file_range: copy data between two file descriptors, using filesystem support for efficient copies if possible https://lwn.net/Articles/659523/

* mlock2: add a flags argument so you can mlock memory when it's next accessed https://lwn.net/Articles/650538/

* membarrier: force a memory barrier on all running threads to help with userspace RCU, garbage collection, etc. http://man7.org/linux/man-pages/man2/membarrier.2.html

* userfaultfd: implement userspace paging https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt

* execveat: a version of execve that takes a file descriptor (or a fd and relative path) instead of a string to execute http://man7.org/linux/man-pages/man2/execveat.2.html

eximius 9 years ago |

Hmm... This is certainly very interesting. Can anyone think of any neat kernel-only things that one might implement for kicks as a learning project? Particularly for someone who hasn't done kernel programming? It could definitely be a silly thing, but probably more useful than printing to the kernel log.

jevinskie 9 years ago | |

Providing guaranteed access to random numbers has been a recent example of a new, badly needed, but fairly simple syscall. With getrandom(), you avoid the complexities of open/read/close and its associated error handling.

https://lwn.net/Articles/606141/

brenns10 9 years ago | |

Character devices are a fertile ground for cool projects in my opinion. They're not very hard to make (typically just a kernel module, no booting a custom kernel), most unix tools interact with them naturally because they're just files, and they can do many interesting things within the kernel. One of my recent projects (after system calls) was to create a "chat server" in the kernel with a character device. Good references are Robert Love's Linux Kernel Development, 3rd edition, and the Linux Kernel Module Development Guide.

voltagex_ 9 years ago |

Great tutorial. Just a tip - if you change www.kernel.org to cdn.kernel.org you'll get a closer mirror site.

brenns10 9 years ago | |

Oh very nice, thanks for the tip!

dezgeg 9 years ago |

One correction to the strncpy_from_user part, specifically this:

> The process could try to read another process’s memory by giving a pointer that maps into another process’s address space.

This cannot happen, there is no such thing as "a pointer that maps into another process's address space". A virtual address in Linux (on x86 and probably almost all arches) accesses either the processes own memory map (where access to unmapped addresses causes a fault even when done from ring 0) or the kernel virtual mapping.

rogerb 9 years ago |

Really cool tutorial, thanks for writing this up !

xenadu02 9 years ago |

I thought Linux uses sysenter/sysexit, not int 0x80/iret?

Still a good tutorial; there is no magic, it's all just software.

conductor 9 years ago | |

You are right, Linux uses INT 0x80 on x86 only when the SYSCALL/SYSENTER and SYSRET/SYSEXIT instructions are not available.

brenns10 9 years ago | |

Yes, I have a few inaccuracies I'm correcting right now. That is one of them.