You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)(ze3tar.github.io) |
You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)(ze3tar.github.io) |
But still might be an open threat. On the email thread Jens seems to think that this is already patched and in stable, he also points out that for this exploit to work (as written in the article) you already need escalated privileges [2] Catchy title though.
[1] https://snailsploit.com/security-research/general/io-uring-z...
Slowly at first, and then suddenly. AI assisted anything follows this trend. As capabilities improve, new avenues become "good enough" to automate. Today is security.
C code is broken - period
Am I reading this wrong or is this just a way of executing an arbitrary binary with uid=0 if you have both CAP_NET_ADMIN and CAP_SYS_ADMIN?
If you can write modprobe_path, is it really news that you can find a way to execute code?
Almost all distros allow unprivileged user namespaces, and in my opinion this is the right decision, because they're important for browser sandboxing which I think is more important than LPEs.
static markdown version: https://raw.githubusercontent.com/ze3tar/ze3tar.github.io/9d...
That said, putting stuff in a docker container is kinda a light lift that cuts a bunch of attack surface.
And sorry, but I am ... frustrated by this. Why do my Debian 11 servers (currently upgrading, yes) have support for phone infrastructure from the 90s (ATM), or really obscure file systems like "Andrews File System" or support to run IP across amateur radios (AX.25) by default? We recently joked that we should start a pot you add a euro to whenever you find ancient discontinued tech you never heard about our systems support so we can have some nice dinner after this.
I do understand that going full Gentoo or Arch as a generally available distro is not feasible. I am also personally intimidated by compiling my own kernel with just what we need. But the amount of strange ancient things supported by default is also quite ridiculous.
Copy Fail [1]
Copy Fail 2: Electric Boogaloo [2]
Dirty Frag [3]
And now this...
[1]: https://copy.fail
[2]: https://github.com/0xdeadbeefnetwork/Copy_Fail2-Electric_Boo...
This one is a level less severe.
The title looks like clickbait to me.
[^0]: https://www.openwall.com/lists/oss-security/2026/05/08/10
[^1]: https://www.openwall.com/lists/oss-security/2026/05/08/14
clang -fbounds-safety ...
also see lib0xc etc.: https://news.ycombinator.com/item?id=47978834This seems on the low impact end of the numerous historical io_uring issues.
Interesting and important all the same.
Is it considered good pactice to publish a vulnerability not yet patched in any stable branch?
Joke aside, we'll see more CVEs in the coming months, and in a sense that's good: it leaves less maneuvering room for bad actors (especially those selling them to the highest bidder).
Can we make sandboxing the new default now? Flatpak does a good job, but we're still pretty far away for apt/yum/pacman installed packages. AppArmor was a decent step forward, but clearly not enough.
Linux is falling apart faster than it can assign these CVEs.
It is very important that you realize that any capability is a slice of superuser privileges, and there are no implicit protections, only explicit additional constraints that restrict it in reference to root.
Look at the bounding set for a normal user on a fresh install of rhel/debian based systems:
$ grep ^Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
Note how trivial it is to gain all of those capabilities: $ podman unshare
# grep ^Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
# capsh --decode=000001ffffffffff
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
The capabilities(7)[0] man page will help you with all of those.But capabilities are just a thread local segmentation, which grants superuser or root rights in a vertical segmented fashion.
True, if a mechanism chooses to do additional tests based on credentials(7)[1], you can run with those elevated privileges in a lower bound, but that requires implicit coding.
Add in that LSMs are suffering from both resources and upstream teams that won't provide guidance or are challenging to work with, and there are literally a hundred commands to either abuse or just ld_preload to get unrestricted userns, allowing you to get around basic controls on clone()/unshare() that may be implemented.
$ grep -ir "userns," /etc/apparmor.d/ | wc -l
100
With apparmor every single browser (firefox,chrome,msedge,etc...) as well as busybox, slack, steam, visual studio, ... all have the unrestricted user namespaces and the ability to gain the FULL set of capabilities in the bounding set.If you run `busybox` on a debian system, note how it has nsenter and unshare, so you can't mask those and yet busybox itself is unconstrained with elevated privlages.
The TL;DR point being, don't assume that any capability() is in itself a gate, as there are so many ways even for the user nobody to gain them.
[0] https://man7.org/linux/man-pages/man7/capabilities.7.html [1] https://man7.org/linux/man-pages/man7/credentials.7.html
2. Most sandboxes (including Docker and Podman) disable creating unprivileged user namespaces inside them via seccomp. In this mode, you end up with a more secure setup than requiring a privileged process to spawn containers (for one, it massively reduces the risk of confused deputy attacks against container runtimes). You can also restrict it with ucounts (as rough of a system as that is).
3. The kernel provides this facility and the feature was added back in early 2013 (before Docker existed and long before they added user namespace support, let alone rooless containers), so I don't understand why you think this is somehow the fault of OCI? We're just making something useful out of existing kernel infrastructure. Folks have asked the kernel to provide a knob to disable unprivileged user namespaces but the maintainer has refused to do so for years (the best you get is ucounts and seccomp). I would also prefer to have such a knob (or even adding a separate ucount with configurable per-user limits) but it's not up to me.
(Disclaimer: I implemented rootless containers for runc back in the day and work on OCI, so I do have some bias here.)
On macOS you can try it with:
clang -Xclang -fbounds-safety program.c
Microsoft also seems to be using it (see above link regarding lib0xc).Linus' law was wrong because there were never enough (qualified) eyeballs to check the code. LLMs provide an ample supply of eyeballs (though it's not a benefit to open source, since proprietary developers can use the same LLMs).
Thanks to agents and tool calling, there are now business cases that can be fully described by AI tooling, the next step in microservices, serverless and what not.
Naturally with a much smaller team than what was required previously.
Agents are capable of finding this kind of stuff now and people are having a field day using them to find high-profile CVEs for fun or profit.
Some codepaths do ns_capable() (must have capability in owning namespace, reachable via unprivileged user namespaces), some do capable() (must have capability in host user namespace, not reachable via user namespaces at all).
ZCRX can only be enabled by passing capable(CAP_NET_ADMIN), so you need to be privileged on the host.
If a kernel feature is gated on cap_sys_admin only, it doesn't matter at all what namespace it is in. Namespace support or additional constraints are not implicit and have to be added to each need.
People misunderstanding this is partially why we have this latest crop of vulnerabilities.
The capable()[0] syscall operates as one would expect for granting superior capabilities, and while the work to expand the isolation is something I am sure you are familiar with, you probably also realize that the number of entries in a default user also expanded just to support user namespaces.
But to be clear, the choices that docker/oci made are understandable from a local greedy choice perspective, it complicates the entire user space.
K8s mutating inlet controllers are a symptom of those choices.
Had a CRI contained a bounding set, enforced at a system level, especially with guidance and tools for users to use a minimal set, which they could expand on easily we would be in a better spot.
But as other projects cannot provide meaningful protections that cannot be simply bypassed by calling privileged CRIs it is also a barrier to convincing them to do the same.
Really there is a larger problem that OCI could be the leader on, but they are the ‘killer app’ and refuse to do so.
The bounding set for user capabilities is driven by containers, and while namespaces are not and never have been a security feature, this blocks their ability to have a strong security posture.
To be clear, expecting every end user to write minimal seccomp profiles is unrealistic, especially when docker prevents devs from accessing the local machine to discover what is happening. I think podman is the only machine that allows that by default.
Basically while simplifying moby/containerd/CRI is an understandable choice, the refusal to address the costs of that local optim has fallout
[0] https://elixir.bootlin.com/linux/v7.0.5/source/kernel/capabi...
1. Pick a file to seed as a starting place.
2. Ask the LLM (in an agent harness) to find a vulnerability by starting there.
3. If it claims to have found something, ask another one to create an exploit/verify it/prove it or whatever.
4. If both conclude there is a vuln, then with the latest models you almost certainly found something real.
Just run it against every file in a repo, or select a subset, or have an LLM select files with a simple "what X files look likely to have vulns?".
So basically yes, it is that simple. It's just a matter of having the money to pay for the tokens.
Given Windows' absurd amount of backwards compatibility, chances are pretty high that there are a lot of sleeping dragons buried inside even modern Windows 10/11 kernel and userland that date back to code and issues from the 90s - code where half the people who have worked on it probably not just have departed Microsoft but departed living in the meantime.
Also contrary to Linux, Windows 11 (optional on W10) uses sandboxing for kernel and drivers.
Since Windows XP SP2 that Windows keeps getting mitigations, Microsoft has security teams whose day job is to attack Windows.
They are also promoting using CoPilot for C and C++ code review for some time now.
While it won't stop all attacks, it is better than the whole UNIX is safer than Windows attitude from the 90's, turns out it is a matter of how much money is into it.
Want really safe above anything else, look into Qube OS with its sandboxing over everything, or mainframe systems like Unysis ClearPath MCP, with NEWP as systems language, and managed environments.
Rust is bounds checked by default. C is not. Defaults matter because, without a convincing reason, most people program in the default way.
Possible problems within a function should be discoverable.
This particular bug would be hard to discover for a typical linter unless they knew/remembered that there are two execution paths for cleanup of a given element.
Also nice the onion reference by op.
see https://scan.coverity.com/projects/linux for the linux-specific scan results - you need to create an account to view the reported defects.
This past couple of weeks isn't a good look for them with the releases of defects found in Linux and Firefox.
There are other free ones, I don't know if they're run as a matter of course.
Also unsafe rust doesn't remove bounds checks. arr[idx] is bounds checked in every context.
You can opt out of array bounds checking by writing unsafe { arr.get_unchecked(idx) } . But thats incredibly rare in practice.
[1] https://cs.stanford.edu/~aozdemir/blog/unsafe-rust-syntax/
Based on the raw number of assorted crates, which has no bearing on kernel code. The more relevant question is, can a performant, cross-architecture, kernel ring-buffer be written in safe Rust?
Really? Why? I've not used Rust outside of some fairly small efforts, but I've never found a reason to reach for unsafe. So why is "nearly everyone" else using it?
It’s always a way lower number than folks assume. Even in spaces that have higher than average usage.
The entirety of safe Rust is built upon unsafe Rust that's abstracted like this. The fact that you sometimes need unsafe isn't a mark against Rust, but literally the entire premise of the language and the exact problem it's designed to solve.
This is something a lot of people misunderstand about unsafe rust. The safe / unsafe distinction isn't at the crate level. You don't say "this entire module opts out of safety checks". Unsafe is a granular thing. The unsafe keyword doesn't turn off the borrow checker. It just lets you dereference pointers (and do a few other tricks).
Systems code written in rust often has a few unsafe functions which interact with the actual hardware. But all the high level logic - which is usually most of the code by volume - can be written using safe, higher level abstractions.
"Can all of io_uring be written in safe rust?" - probably not, no. But could you write the vast majority of io_uring in safe rust? Almost certainly. This bug is a great example. In this case, the problematic function was this one:
static void io_zcrx_return_niov_freelist(struct net_iov *niov)
{
struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
spin_lock_bh(&area->freelist_lock);
area->freelist[area->free_count++] = net_iov_idx(niov);
spin_unlock_bh(&area->freelist_lock);
}
At a glance, this function absolutely could have been written in safe rust. And even if it was unsafe, array lookups in rust are still bounds checked.So the vast majority of Rust projects involve writing at least one unsafe block? Is that really your claim?
How do you know the unsafe operation is safe? What are the preconditions the code block has? Write it down, review it, test it.
I like type checking and other compile time checks, but sometimes they feel very ceremonial. And all of them are inference based, so they still relies on the axiom being right and that the chain of rules is not broken somewhere. And in the end they are annotations, not the runtime algorithm.
Yes, which is precisely why I write in Rust, because the compiler errs less than I do.
On the other hand, when you're relying on your ability to "actually write quite good C code"...you'd better hope that you have not made an error there. In practice, some of the most widely used C libraries in the world still seem to have bugs like this, so I don't really understand why you'd think that's a winning strategy.