Tutorial – Write a System Call(brennan.io) |
Tutorial – Write a System Call(brennan.io) |
> I recall being quite amused sometimes at what people would write as their alt text.
XKCD :)
In a strict technical sense, there's nothing you need a syscall for, you can just read/write data (or maybe do an ioctl) on a new device node or something. In fact, OpenAFS supports routing its "syscall" on Linux through ioctls on /proc/fs/openafs/syscall, because Linux makes it deliberately annoying to patch the syscall table from a kernel module so as to make life harder for rootkits.
However, it's simpler to pass data structures if you can use a syscall. It's much higher-performance than opening a file node. And if you expect to run in an environment where you don't know if a particular file will exist (e.g., a chroot), it's useful to use a syscall directly, because that's always available. For instance, getrandom was added in July 2014 partly for this reason, and partly so that if you ran out of file descriptors to open /dev/urandom you could still get randomness.
Here are all the syscalls added in the last two years:
* pkey_mprotect, pkey_alloc, pkey_free: support for a new Intel processor feature, Memory Protection Keys https://lwn.net/Articles/643797/
* preadv2, pwritev2: add a flags argument so you can do a non-blocking preadv or pwritev without opening the file in non-blocking mode https://lwn.net/Articles/670231/
* copy_file_range: copy data between two file descriptors, using filesystem support for efficient copies if possible https://lwn.net/Articles/659523/
* mlock2: add a flags argument so you can mlock memory when it's next accessed https://lwn.net/Articles/650538/
* membarrier: force a memory barrier on all running threads to help with userspace RCU, garbage collection, etc. http://man7.org/linux/man-pages/man2/membarrier.2.html
* userfaultfd: implement userspace paging https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt
* execveat: a version of execve that takes a file descriptor (or a fd and relative path) instead of a string to execute http://man7.org/linux/man-pages/man2/execveat.2.html
> The process could try to read another process’s memory by giving a pointer that maps into another process’s address space.
This cannot happen, there is no such thing as "a pointer that maps into another process's address space". A virtual address in Linux (on x86 and probably almost all arches) accesses either the processes own memory map (where access to unmapped addresses causes a fault even when done from ring 0) or the kernel virtual mapping.
Still a good tutorial; there is no magic, it's all just software.