Implement unprivileged chroot

Implement unprivileged chroot(cgit.freebsd.org)

231 points by 0mp 4 years ago | 57 comments

EdSchouten 4 years ago |

FreeBSD already supported something like this effectively, but in my opinion better way.

You can call cap_enter(), which disables open(), unlink(), mkdir(), etc. entirely. You can, however, still use openat(), unlinkat(), mkdirat() with relative paths that expand to a location underneath a directory file descriptor. This achieves the same thing, except that you can now have as many chroots as you want. Not just one.

Unfortunately, the idea never caught on, because virtually no software on UNIX uses the *at() functions. Also: the non-*at() functions are still available as symbols, meaning that you can't perform simple compile-time checks to ensure that you application works properly when this form of sandboxing is enabled. Turns out that off-the-shelf software (e.g., libraries) end up misbehaving in unpredictable ways if you disable ~50% of the POSIX API.

It's a shame, because this feature effectively requires you to treat the file system in an object oriented/dependency injected way. Pretty good from a reusability/testability perspective.

jerf 4 years ago | |

One of my minor disappointments with Go, considering the time it came out and the UNIX heritage that it descended from, was that it didn't prioritize the *at() functions. It's difficult, if not virtually impossible, to write secure code with the "traditional" path-based system because every time you do one thing, then some other thing to a path that has some sort of security implication, you've written a TOCTOU problem if somebody can wedge between those two things to change some critical aspect of the file.

It's hard for me to blame programmers for not using these functions more when hardly any language properly exposes them. But since nobody exposes them, nobody's aware they should use them.... chicken & egg strike again.

tines 4 years ago | | |

But openat, for example, is still path-based; it just changes the directory that the path is relative to. If you give it an absolute path, it will open it, and I didn't see any reason in the man page why you couldn't just pass in a bunch of ../../ as the usual exploits do. Maybe you're referring to another category of bugs?

donio 4 years ago | | |

os.Open has been using openat since 2015.

https://github.com/golang/go/commit/e7a7352e527ca275a2b66cc3...

catlifeonmars 4 years ago | | |

I’m confused. How would using *at() APIs prevent race conditions?

c0l0 4 years ago | |

FWIW (and iirc), with programs using recent-ish glibc, you will never see a call to open() in the wild unless the program takes special care to bypass the implicit libc wrapper. glibc will transparently convert these calls to openat() under its own hood. I do notice that this probably doesn't do you any good on FreeBSD, though :)

markjdb 4 years ago | | |

This is mostly true on FreeBSD as well. The real problem is that capability mode also disallows openat(AT_FDCWD) - there has to be an explicit directory descriptor.

aduitsis 4 years ago | |

Mildly off-topic note, the parent is the author of CloudABI (https://github.com/NuxiNL/cloudlibc), which was (in my opinion) a truly brilliant approach to running untrusted code in a FreeBSD system.

toast0 4 years ago | |

Capabilities mode is useful, but it's very difficult to apply to programs that don't fit the model.

If you need to make network connections, you have to do that before entering capabilities mode, because there is no capability to allow it later. You can work through a proxy program, but adding that complexity doesn't seem worthwhile to me unless your program to be sandboxed is very complex.

I haven't worked with OpenBSD's pledge, but the idea of being able to end use of specific dangerous things seems more widely applicable.

EdSchouten 4 years ago | | |

> You can work through a proxy program, but adding that complexity doesn't seem worthwhile to me unless your program to be sandboxed is very complex.

I would love it if all network connections of all programs were created through a proxy. It would allow me to do load balancing, firewalling, tunneling, packet capturing, etc. etc. etc. entirely in userspace, without needing to rely on administrative features like pf/iptables, tun/tap, bpf, etc..

You see that in Kubernetes land folks are trying to achieve the same thing by using so-called service meshes (e.g., https://istio.io ). Right now those systems launch a proxy next to every container. For projects like these, it would have been so much easier if UNIX-like systems already had a standard for making the network stack used by a program injectable.

phicoh 4 years ago | |

The problem is that many libraries need access to configuration files or other stuff that comes with the library.

So if you start with a system that has some form of persistent objects, then very quickly a root namespace object is created to solve those library issues.

And then you are mostly back to a Unix root directory.

wahern 4 years ago | | |

cap_enter can be invoked after library initialization. Libraries can open the files and directories they need during initialization.

A single jailed root is where you end up when you take the route of putting software into sandboxes for which they weren't designed, because now you need to emulate a traditional environment.

pledge and unveil are a middle ground, albeit closer to Capsicum, in that they're much more accommodating of existing software patterns. But they do still require application refactoring. OpenBSD has refactored their entire userland codebase this way. That typically involves identifying the necessary resources a program needs and either shifting their acquisition to before privilege dropping (i.e. early in main), or arranging so that they're subsequently accessible (e.g. using unveil).

It's a shame Linux never merged the Capsicum patches. While pledge and unveil are more convenient from a developer perspective, they can't easily be adopted in a standardized way by other operating systems, like Linux. Capsicum was the closest thing we could have gotten to a standardized sandboxing model in the POSIX universe. If it became widely available (cough Linux), I believe a large chunk of software, especially critical network-facing software, would slowly migrate; and an ecosystem of idioms, patterns, and libraries would evolve to increasingly smooth the transition.

What's doubly shameful is that Capsicum is architecturally extremely simple. In principle it would be easy for any POSIX system to adopt. The APIs are trivial, and Linux is already nearly there now that it has process descriptors and an openat that can prevent parent directory traversal. Most of the leg work is in blocking access, after cap_enter has been invoked, to non-standard interfaces and syscalls that expose resources.

silon42 4 years ago | |

You would need to standardize passing of current root as a file handle, I think? Probably will break some software...

GoblinSlayer 4 years ago | |

Why not treat open(path) as openat(AT_FDCWD,path)?

wizzwizz4 4 years ago | | |

Because cap_enter() blocks that too.

stabbles 4 years ago |

On many linux distro's you can already do this with user namespaces:

    $ mkdir rootfs
    $ docker export $(docker create ubuntu:20.04) | tar -C rootfs -xf -
    $ unshare -r chroot rootfs bash
    # ls
    bin   dev  home ...

Very often when you use chroot you also want unprivileged mounts, in particular overlay mounts if you don't want to mutate the underlying rootfs. You can do that with mount namespaces: `unshare -rm`, but you need Linux kernel 5.13 (or a distro with a patched kernel like Ubuntu) to allow unpriviliged overlayfs.

dividuum 4 years ago | |

An alternative to unshare is also bubblewrap (https://github.com/containers/bubblewrap) which also sets up a new namespace. You can build up your own new filesystem by binding existing paths into the new root and then run a process within it:

    $ mkdir -p root/bin
    $ cp /bin/busybox root/bin/
    $ bwrap --bind root / /bin/busybox sh

    BusyBox v1.27.2 (Ubuntu 1:1.27.2-2ubuntu3.3) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    / $ ls -l /
    total 0
    drwxrwxr-x    2 1000     1000            60 Jul 22 11:07 bin

Cloudef 4 years ago | | |

I used bubblewrap to do a lightweight containers on top of arch + pacman. Basically you could install packages on overlays of the host and do whatever there without affecting the host fs. It was pretty nice.

gigatexal 4 years ago | | |

Interesting. Going to check out Bubblewrap

rkeene2 4 years ago | |

As an alternative, one can also use User Mode Linux (UML) to implement a pretty fancy chroot (and fakeroot).

It can do a few things userns can't, like load kernel modules. I've had to use this to deal with bugs in BtrFS before.

gunapologist99 4 years ago | | |

I love UML. I used to use it all the time. Is it still developed? It was really a pretty slick system and very easy to work with.

http://user-mode-linux.sourceforge.net/

(another good site: https://wiki.archlinux.org/title/User-mode_Linux )

marcodiego 4 years ago |

*BSD have been quite innovative recently. The pledge and unveil syscalls, although achievable by other means on linux, are very simple and effective for what they do. I don't know a way on linux to use a system on a directory without being root; even if possible I'd still need root to mount --bind some dirs, but definitely something I'd like to do.

I don't think containers should be needed for that.

geofft 4 years ago | |

On Linux, you can do

    unshare --user --mount --map-root-user chroot /path/to/whatever

and if you need to bind-mount some directories, you can do that before the chroot, e.g.,

    $ unshare --user --mount --map-root-user
    # mount --bind /proc /path/to/whatver/proc
    # mount --bind /proc /path/to/whatver/sys
    # chroot /path/to/whatever

without being root. (This requires a sysctl to be enabled for unprivileged user namespaces, which is on by default in the kernel.org tree and I think all major distro kernels have it on now. The feature has been in the upstream kernel since 2013.)

If you want to do this at scale, a handy tool is bwrap(1) from https://github.com/containers/bubblewrap . (The README talks about how bwrap is a setuid program to prevent the need for that sysctl, but it also works great as a non-setuid program when that sysctl is enabled, and its value is it has a bunch of handy command-line flags for this sort of thing. We use it extensively at my workplace in non-setuid mode for things that don't quite need containers but need to see alternative root directories etc.)

lima 4 years ago | |

"containers" are just a combination of multiple kernel features, one of which does precisely that (user namespaces).

pjmlp 4 years ago | | |

And were known as vaults on HP-UX 11, back in 2000.

geofft 4 years ago |

I wish Linux would do this. Patches are available: https://lwn.net/Articles/849125/

Yes, you can do this on Linux with a user namespace, but a user namespace changes the view of user accounts. You have to map every usable UID inside the namespace to a UID you control outside the namespace. At best, you can map a range of UIDs you control to "real" users (root, 1000, etc.) inside the namespace, but they won't be real users outside the namespace. If you're on a multi-user system, seeing other people's files as owned by "nobody" is confusing.

It should be enough to use NO_NEW_PRIVS mode, meaning setuid transitions are not allowed. Then it doesn't matter what user IDs you see inside the chroot.

In fact, back when Linux introduced the NO_NEW_PRIVS flag (almost a decade ago!), this was one of the motivating use cases.

thenoblesunfish 4 years ago |

For those, like me, lacking context, what are the implications of this?

phicoh 4 years ago | |

The key feature of chroot is that you can provide a process with a completely different filesystem view. You can leave stuff out that exist in the standard view, or change things. Change the contents of system directories.

The problem with traditional chroot is that you can typically import setuid applications in this new space which can get confused, for example by a new /etc/passwd file. For this reason, chroot can be used only by root.

The advantage of such a NO_NEW_PRIVS flag is that this kind of abuse of setuid applications is not possible.

This should make it safe to allow ordinary users to use chroot.

codetrotter 4 years ago | |

chroot is a system call that assigns a limited view of the file system to a process. In particular it makes it so that the specific directory will appear as the top level directory to the process.

Some people like to run for example FTP servers in a chroot so that users have access only to a specific directory and its subdirectories, rather than being able to browse other files on the system.

FreeBSD also has a technology called jails which is what you’d rather use for containerization.

Anyway, previously you had to be root (the Unix admin user) in order to use chroot. FreeBSD now implementing unprivileged chroot means that regular users are able to run processes in chroot as well.

So for example if you were a regular user on a system, you can now create a sub directory in your home directory and run an FTP demon chrooted to that directory and bound to an unprivileged port, and then you can give someone else FTP access to that directory without them being able to see the other files in your home directory, keeping your private data private from them.

tyingq 4 years ago | |

chroot existed, but could only be run as the root user. It was that way to prevent things like this (old actual exploit for Ultrix):

  $ mkdir /tmp/etc
  $ echo root::0:0::/:/bin/sh > /tmp/etc/passwd
  $ mkdir /tmp/bin
  $ cp /bin/sh /tmp/bin/sh
  $ cp /bin/chmod /tmp/bin/chmod
  $ chroot /tmp /bin/login
  # whoami
  root
  # chmod 4700 /bin/sh
  now, log out of the chroot and use your newly minted setuid shell

Since they now have the "NO_NEW_PRIVS" protection, they can let regular users safely use chroot.

jsiepkes 4 years ago | |

You can for example run a build in a chroot as a unprivileged user.

krylon 4 years ago |

The commit message does NOT indicate when this will be available to mere mortals like myself.

Can someone enlighten me if this will be part of FreeBSD 14, or if there is a chance it will become available earlier, perhaps with FreeBSD 13.1?

EDIT: The commit message does NOT indicate etc. Silly me.

0mp 4 years ago | |

The commit message does not mention any MFC timeline [1] so this feature is not planned to be merged back into existing stable branches. In other words, the first release with this feature is going to be FreeBSD 14.0-RELEASE.

[1]: Also, you may look for the commit hash (a40cf4175c90142442d0c6515f6c83956336699) at https://mfc.kernelnomicon.org/ to see the back-porting status.

swills 4 years ago | | |

This feature should be in the weekly snapshot pretty soon:

https://download.freebsd.org/ftp/snapshots/ISO-IMAGES/14.0/

HPsquared 4 years ago |

In Linux there's "PRoot" - used by Termux on Android to provide userspace chroot-like functionality (can run Debian, for instance).

https://proot-me.github.io/