Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, I intend to publish both the description of exploitation techniques and also the exploit source code on Monday 15th by email to this list."
Interesting.. they didn't write what conditions have to be met for it to be exploitable. Also interesting that someone screwed up and accidentally forwarded an email including the exploit to a broad mailing list...
Part of the nf modules are active if you have iptables, which you have if you run ufw (for example), so pretty broad exploit if that's all that's required, but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least.
This doesn't matter since Linux has autoloading of most network modules, and you can cause the modules to be loaded on Ubuntu since it supports unprivileged user/net namespaces.
ubuntu:~% grep DISTRIB_DESCRIPTION /etc/lsb-release
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"
ubuntu:~% lsmod|grep nf_table
ubuntu:~% unshare -U -m -n -r
ubuntu:~% nft add table inet filter
ubuntu:~% lsmod|grep nf_table
nf_tables 249856 0 ...$ lsmod|grep nf_table (tried without any just to make sure)
...$ unshare -U -m -n -r
unshare: unshare: failed: Operation not permitted
...$ /sbin/nft add table inet filter
Error: Could not process rule: Operation not permitted
add table inet filter
^^^^^^^^^^^^^^^^^^^^^^
root # cat /proc/sys/kernel/unprivileged_userns_clone
0> Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, [...]
What? Someone publishes information about your vuln to a random mailing list, and this somehow creates an obligation on you to follow that mailing list's policies? I don't get it.
[0] https://oss-security.openwall.org/wiki/mailing-lists/distros
https://oss-security.openwall.org/wiki/mailing-lists/distros
> Please note that the maximum acceptable embargo period for issues disclosed to these lists is 14 days. Please do not ask for a longer embargo. In fact, embargo periods shorter than 7 days are preferable.
You can "apt update; apt upgrade" then reboot when a new kernel is available.
Oracle has also offered Ksplice for free on Ubuntu for many years, and I'm sure that patch will be available promptly.
https://ksplice.oracle.com/try/desktop
Otherwise, Kernelcare is available for a fee. I think Canonical also has paid kernel patches.
"I vaguely recall at least around 6-7 such holes, and a quick google search seems to reveal that at least those would have been mitigated by unprivileged user namespaces being disabled: CVE-2019-18198 CVE-2020-14386 CVE-2022-0185 CVE-2022-24122 CVE-2022-25636 CVE-2022-1966 resp. CVE-2022-32250"
Reminder the kernel has over ten million LoCs, or megabytes of object code.
Perhaps we should start thinking about whether it is a good idea to run something this large in supervisor mode, with full privileges.
I wouldn't say it is sensible in a world where seL4 exists.
I'd be very interested to hear how this can be done by an unprivileged user.
Try to race set add/removals, sure, but if it depends on the set itself getting deleted, that seems… harder.
Which is the default on Ubuntu.
The NIST CVE page points back here. Funny.
Nothing I see so far specifically says how far back this goes, but, https://security-tracker.debian.org/tracker/CVE-2023-32233
Seems to go back really far.
By that I mean, it might be easy or hard to exploit a bug to achieve LPE, but it seems to be redundant to prove that it is possible.
Context: My experience with C programming is that practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.
https://github.com/igo95862/bubblejail
In the next not yet released version 0.8.0 there will be a new option to disable a specific namespace type per sandbox. For example, disabling the network namespace would prevent this exploit.
This is more flexible than globally disabling all user namespaces as some programs might use other more harmless namespaces like Steam uses mount namespaces to setup runtime libraries.
It's like why it doesn't matter if you are running as root or not. The user account has access to whats important, like a database or keychain.
My desktop, on which I am the only person with an account, has 49 "users", of which 11 are actively running a process.
At work, every daemon we run has a dedicated user.
On android, every app runs as its own user.
If you can CI/CD in minutes a reduced kernel+app and reboot in 100ms your network-facing thing (be it nginx or haproxy) you might just take latest vanilla anyway...
How would we go about GPUs, NCs, and many kinds of drivers?
If you can read exploit code to determine if patching is worth it for your use case, you can probably also read diffs for the same outcome.
I’m not saying don’t release them, but releasing them with short notice seems irresponsible, without much benefit to defenders.
Andy Lutomirski described some concerns of his own:
> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN over /any/ network namespace and to thus access the network configuration API to be a huge risk. For example, unprivileged users can program iptables. I'll eat my hat if there are no privilege escalations in there.
If it says (nf_tables), you are using the compatibility layer from the iptables-nft package.
It works quite well. Apps like Docker that inserts rules using the legacy iptables syntax are oblivious to the fact that they are actually inserting nftables rules.
It also provides an easy migration path. Insert your old rules using your iptables script then list them in the new syntax using nft list ruleset.
The problem is that it works so well that it seems most users just stayed with the iptables syntax and did not bother migrating at all.
I somehow got accustomed to the nftables rules format. It is in fact objectively much better than the iptables format in many ways. The native JSON, easy bulk submit to the kernel, built-in sets and maps (the source of the currently discussed CVE though). It really does fix a lot of what was wrong with iptables.
But iptables was probably not broken enough for most users to warrant re-learning everything.
Now, the traffic shaping tool, oof.. I still cannot grok any of it. I've been happy with the fireqos script so far to abstract everything out of the tc syntax.
struct foo *whatever = new_foo();
// use 'whatever'
free_foo(whatever);
if (whatever->did_something) {
log_message("The whatever did something.");
}
// never use 'whatever' after this point
The 'whatever' variable is used after what it points to is freed, but it's not exploitable. Worst case, if new memory gets allocated in its place and an attacker controls the data in the offset of the 'did_something' field, the attacker can control whether we log a message or not, which isn't a security vulnerability.I am making assumptions here: That pre-emption is possible (at least some interrupts are enabled), that "whatever" points to virtual memory (some architectures have non-mappable physical memory pointers), and that a page fault at this point is actually harmful.
However I do want to point out that the reasoning why your example is not exploitable isn't as easy as it first seems.
For use-after-free to be exploitable, by definition an attacker must be able to put arbitrary content at the memory region. This is not always easy: may require certain [mis]configuration, data layout and so on.
> practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.
I will not contest this claim, however there is a difference between "blow up" and "exploit". Malicious packet being able to segfault a server is one thing, malicious packet resulting in RCE is quite another. This may be a lost in translation moment when under colloquial use "exploit" does not include DoS.
priv->set->use++;
To look like: nf_tables_activate_set(ctx, priv->set);
Where this function is defined as: void nf_tables_activate_set(const struct nft_ctx *ctx, struct nft_set *set) {
if (nft_set_is_anonymous(set))
nft_clear(ctx->net, set);
set->use++;
}
So to me (someone who is not an expert in this code) it looks like the fix is checking if the set has the anonymous flag before changing the reference count. I'm not an expert in this code and I could be mistaken, but I think your claim that this would be fixed by Rust object lifetime checking requires better evidence.More generally: doing a line-by-line translation from C to Rust is never going to be the best way to make use of the capabilities Rust has that C lacks.
Agree - and the Linux kernel is extremely fragile because it is full of ad-hoc manual code like that.
Unfortunately, Rust won't be the rescue, because (in the foreseeable future) Rust will only be available in leaf code due to the many hard problems of transitioning from fragile C APIs to something better. Writing drivers in Rust is useful, but limits the scope of how Rust helps.
Many of Rust's advantages at a tiny fraction of the effort could be had easily with a smooth transition path by switching the compiler from C to C++ mode. The fruit hangs so low, it nearly touches the ground, but a silly Linus rejects C++ for the wrong reasons ("to keep the C++ programmers out", wtf).
Every time I work on the Linux kernel source, I'm horrified by how much pain the kernel developers inflict on themselves. Even with C, it would be possible to install a mandatory coding style that is less fragile.
For example, in the aftermath of the Dirty Pipe vulnerability last year, I submitted a patch to make the code less fragile, a coding style that would have prevented the vulnerability: https://lore.kernel.org/lkml/20220225185431.2617232-4-max.ke... - but my patch went nowhere.
Not saying Rust isn't an improvement, it's a huge improvement over C, but there's no reason to oversell it. Rust is not going to make these errors magically go away, at least not in a kernel, even if you wrote the kernel from scratch, all in Rust. Unless you managed to write all of it in safe Rust which... good luck with that.
Hypervisors, userspace drivers, containers, language runtime sandboxes, bytecode deployments, driver and kernel sandboxes (safe kernel / driver guard),container only distributions,...
All mitigations to achieve microkernel like capabilities.
VMs are much more than micro kernels. It's about allowing the user to install whatever they want in their machine. Containers are just a userland abstraction. Not sure where the link to microkernels is there.
To elaborate, seL4 claims to be the fastest kernel around[0], a claim that remains unchallenged.
To put it into context, the difference in IPC speed is such that you'd need an order of magnitude more IPC for a multiserver system based on seL4 to actually be slower than Linux.
A multiserver design would imply increased IPC use, but not an order of magnitude.
[0]: https://trustworthy.systems/publications/full_text/Elphinsto...
From an observer on the sidelines: there was no namecalling.
He said you trolled, not that you ate a troll. The distinction is important.
Even the best of us troll, sometimes. (Not claiming you did btw, just that there was no name calling.)
> seL4 is the world’s fastest operating system kernel designed for security and safety
Linux is arguably not designed for security and safety but it blows seL4 out of the water when it comes to performance. There’s a reason it only gets used in contexts where security is critical; I would have expected that you would be aware of this considering you were the one who is promoting it.
Can I run Firefox or PostgreSQL on seL4? Or another real-world program of comparable complexity? And how does the performance of that compare to Linux or BSD?
That's really the only benchmark that matters; it's not hard to be fast if your kernel is simple, but simple is often also less useful. Terry Davis claimed TempleOS was faster than Linux, and in some ways he was right too. But TempleOS is also much more limited than Linux and, in the end, not all that useful – even Terry ran it inside a VM.
I've heard these sort of claims about seL4 before, and I've tried to look up some more detailed information about seL4 before, and I've never really found anything convincing on the topic beyond "TempleOS can do loads more context switches than Linux!" type stuff.
Xen is unfortunately large, and the full hypervisor runs privileged.
With seL4, VM exceptions are forwarded to VMM, which handles them.
From a security standpoint, a VM escape would only yield VMM privileges, which are no higher than that of the VM itself. This is much better than a compromise of Xen, which would compromise all VMs in the system.
Makatea[0] is an effort to build a Qubes-like system using seL4 and its virtualization support. It is currently funded by a NLNet grant.
Citation needed.
And by that I mean actual benchmarks of Linux doing the few tasks seL4 does, such as IPC or context switching, faster than seL4.
The multiserver architecture does indeed imply an elevated use of IPC, but it does in no way outweigh the difference in IPC cost.
In this model, data sharing, and the implied locking, is minimized, which as a consequence helps SMP scaling.
Dragonfly, while not multiserver proper, took a different direction than Freebsd and Linux by optimizing IPC and not implementing fine-grained locks, and instead favoring concurrent lockless and lockfree servers.
As a consequence, Dragonfly scales much better than Freebsd, and in many benchmarks manages to outperform Linux.
This is despite the tiny development team, particularly so when considered relative to the amount of funding these two systems get.
I am sickened by the effort that's being wasted on a model that we know is bad and does not work. Linux will never be high assurance, secure or scale past a certain point.
Fortunately, no matter how long it'll take, the better technology will win; there's no "performance hack" that a bad system can pull to catch up with the better technology once it's there.
Just a matter of time.