I've actually been hacking on a similar FOSS project lately, with a focus on building what I'm calling a layer 3 service mesh for the edge. More or less came out of my learned hatred for managing mTLS at scale and my dislike for shoving everything through a L7 proxy (insane protocol complexity, weird bugs, and you still have the issue of authenticating you are actually talking to the proxy you expect).
Last week I got the first release of the userspace router shipped, worth taking a look if you want to play around with a completely userspace and unprivileged WireGuard compatible VPN server.
https://github.com/noisysockets/nsh/blob/main/docs/router.md
https://github.com/google/gvisor/tree/go
go get gvisor.dev/gvisor/pkg/tcpip@go
The go branch is auto generated with all of the generated code checked in.
I don't know the status on those export tools these days as I left the company years ago, but if they could sync with a different branch.
This would help various folks quite a bit, as for example tsnet users often fall into the trap of trying to do `go get -u`, which then pulls a non-functional gvisor version.
Unlike, say, GitHub Codespaces, running something like this on your own infra means your incentives and Coder.com's are aligned, i.e. both of you want to reduce your cloud costs (as opposed to, say, GitHub running on Azure gives them an opportunity and incentive to mark up on Azure cloud costs).
We’ve tried to align our pricing with the value of the product. In small teams the productivity gains seem to be much lower, so we target Enterprise!
But exfiltrating data with a userspace VPN is totally fine?
I'm also wondering why not use TLS.
> large multiple performance decrease per dollar spent
Gvisor helps you offer multi-tenant products which can be actually much cheaper to operate and offer to customers, especially when their usage is lower than a single VM would require. Also, a lot of applications won't see big performance hits from running under Gvisor depending on their resource requirements and perf bottlenecks.
Their performance documents you linked claim vs runc: 20-40x syscall overhead, half of redis' QPS, and a 20% increase in runtime in a sample tenserflow script. Also google "CloudRun slow" and "Digital Ocean Apps slow", both are Gvisor.
Literally anything else.
But given this article is about improving gvisors userland tcp performance significantly, it seems like the netstack stuff causes major performance losses too.
I saw a github link in another top article today https://github.com/misprit7/computerraria where the Readme's Pitch section feels very relevant to gvisor.
Google engs recently rewrote the GSO bit, but unlike Tailscale, it is only for TCP, though.
Besides, gvisor has had "software" & "hardware" GSO support for as long as I can remember.
This is approximately the case for any alternative IP stack you might pick though, a mature IP stack is a huge undertaking with all the many flavors of enhancements to IP and particularly TCP over the years, the high variance in platform behaviors and configurations and so on.
In general you should only take on a dependency of a lesser-used IP stack if you're willing to retain or train IP experts in house over the long haul, because as is demonstrated here, taking on such a dependency means eventually you'll find a business need for that expertise. If that's way outside of your budget or wheelhouse, it might be worth skipping.
I see an explanation in their blog about avoiding TUN devices since they require elevated permissions, but why would you need a TUN device to send data to/from an application? I can't understand what their product does from the marketing material but it doesn't look like it would require constructing raw IP packets instead of TCP/UDP packets and letting the OS wrap them in the other layers.
> we’d need a way for the TCP packets to get from the operating system back into Coder for encryption.
yes, this is commonly done via OpenSSL for example.
> This is called a TUN device in unix-style operating systems and creating one requires elevated permissions
waitasec, wut? sure you could use a TUN device I guess, but assuming some kind of multi-tenant separation is an underlying assumption they didn't mention in their intro, couldn't you also use cgroup'd containers? sorry if I'm not fluent in the terminology.
i'm struggling to understand the constraints that push them towards gVisor. simply needing to do encryption doesn't seem like justification. i'm sure they have very good reasons, but needing to satisfy a financial regulator seems orthogonal at best. i would just like to understand those reasons.
† I don't think? I didn't see them say that, and we do the same thing and we don't create raw sockets.
The reason you'd use WireGuard rather than TLS is that it allows you to talk directly to multiple services, using multiple protocols (most notably, things like Postgres and Redis) without having to build custom serverside "gateways" for each of those protocols.
And then you're suddenly in a whole world of pain because all of this is driven by a stack of byzantine certifications (half of which, as usual, are bogus, but that doesn't help you), and your network stack has none of them.
(Written from first-hand experience.)
Pretty much the only thing you can do is somewhat filter out known-bad, not directly motivated outbound traffic, such as malware payloads with very clear signatures. This only works if it's "not directly motivated", because as soon as there's a person who wants to do it, they can skirt around it again.
> We are committed to keeping your data safe through end-to-end encryption and to making Coder easy to run across a wide variety of systems from client laptops and desktops to VMs, containers, and bare metal. If we used the TCP implementation in the OS, we’d need a way for the TCP packets to get from the operating system back into Coder for encryption. This is called a TUN device in unix-style operating systems and creating one requires elevated permissions, limiting who can run Coder and where. Asking for elevated permissions inside secure clusters at regulated financial enterprises or top secret government networks is at best a big delay and at worst a nonstarter.
The specific part that’s unclear is why encryption needs to be applied at the TCP layer and at that point if they need it at the transport layer why they’re not using something like QUIC which has a much more mature user-space implementation.
I think the solution is an automatically exported repository at a different path. Kind of (or maybe exactly) like what Tailscale/bradfitz used to maintain.
> really difficult to keep the version of gVisor I was using up to date
For our project, we update gvisor whenever Tailscale does.
System call overhead does matter, but it’s not the ultimate measure of anything. If it were, gVisor with the KVM platform would be faster than native containers (looking at the runsc-kvm data point which you’ve ignored for an unknown reason). But it is obviously more complex than that alone. For example, let’s click down and ask — how is it even possible to be faster? The default docker seccomp profile itself installs an eBPF filter that slows system calls by 20x! (And this path does not apply within the guest context.) On that basis, should you start shouting that everyone should stop using Docker because of the system call overhead? I would hope not, because looking at any one figure in isolation is dumb — consider the overall application and architecture. Containers themselves have a cost (higher context switch time due to cgroup accounting, costs to devirtualize namespaces in many system calls, etc.) but it’s obviously worth it in most cases.
The redis case is called out as a worst case — the application itself does very little beyond dispatching I/O, so almost everything manifests as overhead. But if you’re doing something that has 20% overhead, you need hard security boundaries, and fine-grained multi-tenancy can lower costs by 80% it might make perfect sense. If something doesn’t work for you because your trade-offs are different, just don’t use it!
You give me too much credit! They were copy pastes to the same responder who responded to me in a few places in the thread. I did that to avoid spending too much time responding!
> because looking at any one figure in isolation is dumb
So the self-reported performance figures are bad, the are hundreds of web pages and support pages reporting slow performance and low startup time from their first hand experience, there are Google hosted documentation pages about how to improve app performance for cloudrun (probably the largest user and creators of Gvisor, can I assume they know how to run it?) including gems like "delete temporary files" and a blog post recommending "using global variables" (I'm not joking). And the accusation is "dumb" cherry-picking? Huh?
Also, if I'm not wrong CloudRun GCP's main (only? besides managed K8s) PaaS container runtime. Presenting it as a general container runtime with ultra fast scaling when people online are reporting 30 second startup times for basic python/node apps, is a joke. These tradeoffs should also be highlighted somewhere in these sales pages, but they're not.
This is the last I'm responding to this thread. Also my apologies to the Coder folks for going off topic like this.
IIRC, CloudRun has multiple modes of operation (fully-managed and in a K8s cluster) and different sandboxes for the fully-managed environment (VM-based and gVisor-based). Like everything, performance depends a lot of the specifics — for example, the network depends a lot more on the network path (e.g. are you using a VPC connector?) than it does the specific sandbox or network stack (i.e. if you want to push 40gbps, spin up a dedicated GCE instance.) Similarly, the lack of a persistent disk is a design choice for multiple reasons — if you need a lot of non-tmpfs disk or persistent state, CloudRun might not be the right place for the service.
It sounds like you personally had a bad experience or hit a sharp edge, which sucks and I empathize — but I think you can just be concrete about that rather than projecting with system call times (I’d be happy to give you the reason gen1 sandbox would be slow for a typical heavy python app doing 20,000 stats on startup — and it’s real but not really system calls or anything you’re pointing at,… either way you could just turn on gen2 or use other products, e.g. GCE containers, GKE autopilot, etc.).
I’m not sure what’s wrong with advice re: optimizing for a serverless platform (like global variables). I don’t really think it would be sensible to recompute/rebuild application state on any serverless platform on any provider.
Have we proven they're not secure and safe? Have we broken out of containers yet? Heroku was running LXC for years before docker, did they run into major security woes (actual curious)?
If "secured shared environments" is a more specific term meaning "multi user unix environment", I didn't intend to say that.
Though you already mentioned my whole thread is a bit off topic to this post (and I sorta agree) but then baited me with this comment after. I'm happy to drop it and wait for a Gvisor container runtime thread.
If by virtualization you mean VMs, gvisor can be more performant than those based on my experience. For example, AWS claims a p0 coldstart time of ~500ms using Firecracker but I know firsthand that applications sandboxed by gvisor can be made to cold start in significantly less time (like less than half): https://catalog.workshops.aws/java-on-aws-lambda/en-US/03-sn..., and you should be able to confirm this yourself by using products that leverage Gvisor under the hood or with your own testing. I actually worked on this (using gvisor, but working on adjacent tech) for years...
> Have we broken out of containers yet?
Sure, how about https://scout.docker.com/vulnerabilities/id/CVE-2024-21626 where runc (Docker) exposed the host filesystem to containerized applications? Precisely the kind of exploit gvisor is designed to prevent.
I'll note that a lot of people are thinking about how to reduce sandbox overhead in multitenant PaaS and it's one of the things I want to eventually address in my own startup. But I think blindly hating on gvisor because of a nebulous dislike of overhead really is misplaced without considering its alternatives.
Portable is a bit of a weird word here because for many of us with gray beards the word means architectures, kernels and systems, but I think in this context it tends to more mean "can run just as easily on my macbook as in a cloud container", but in practice the software isn't that portable, as Go isn't that portable - at least not in the context of vs. a niche C "portable network stack" that would build roughly anywhere that there's a working C toolchain, which is almost everywhere.
Constant security fixes for the kernel are a real pain in deployments unless you follow upstream kernels closely. If your business is in shipping Linux runtimes with a high packing density, you really need to find ways to minimize the exposed Linux surface area, or organize to be able to ship kernel upstream updates at an extremely high frequency (relative to normal infrastructure upgrade rates for kernels / mandatory reboots) (and I would not consider kexec safe in this kind of context, at all).
An alternative approach might be firecracker / microvms and so on, but those have their own tradeoffs too. The core point is that you want more than one layer between the host machines and the user code that wants to interact with Linux features.
I fail to see what "risky surface areas" in the kernel you're avoiding. You have more packets going through the kernel network stack(since you're wrapping a TCP connecting in a UDP connection that goes through the kernel) than just using the TCP stack in the kernel. Are you saying that the TCP stack in the kernel cannot be trusted, but a userspace kernel you maintain can(that's a bit ridiculous...)
> can run just as easily on my macbook as in a cloud container
Any POSIX C code that listens on non-privileged ports will run on machines with the correct glibc version(and you can statically compile the glibc or not need it like go does). This includes linux and macOS(and if you're using a library that's on multiple OSes you get even more support without having to implement TCP in userspace).
> Constant security fixes for the kernel are a real pain in deployments unless you follow upstream kernels closely.
I don't think you understand. You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.
> An alternative approach might be firecracker / microvms and so on
Wouldn't an alternative approach just be to use cross-platform libraries and non-privileged ports?
> The core point is that you want more than one layer between the host machines and the user code that wants to interact with Linux features.
You just said the opposite... how can more things requiring security fixes be a bad thing, while you arbitrarily want more layers between you and the most security tested code for networking available to you.
Yes that’s exactly right. It’s not ridiculous. Netstack is written in a GC’d language which alone eliminates several categories of vulnerabilities that exist in the kernel. But more important than that is that it’s in USERSPACE. So even if you do compromise gVisor netstack the best you have is the capabilities that any other normal process has. Compare that to the kernel vulnerabilities where you potentially have cracked root.
> You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.
The TCP stack is at least an order of magnitude more complex than UDP and has a correspondingly much higher number of bugs filed against it. Only relying on UDP is a security win.
There's a constant stream of bugs in kernel network and IO interfaces, many of which require direct local interaction for exploitation, and aren't remotely attackable. Don't assume, spend a few hours and have a read through some.
> Any POSIX C code that listens on non-privileged ports will run on machines with the correct glibc version(and you can statically compile the glibc or not need it like go does). This includes linux and macOS(and if you're using a library that's on multiple OSes you get even more support without having to implement TCP in userspace).
That doesn't get anywhere near the use case here which is: run third party user supplied code unmodified.
> I don't think you understand. You're still at the mercy of the kernel for security patches to the UDP stack, you're just now also having to maintain a TCP stack in parallel.
The surface is not "UDP" and "TCP", this view is a huge distortion. As I suggested above, have a read through some of the relevant bugs over the last two years, and consider their implications in the relevant use case: running unmodified third party user code on a system.
> Wouldn't an alternative approach just be to use cross-platform libraries and non-privileged ports?
No, again, that doesn't meet the use case: run unmodified third party user code on the system.
> You just said the opposite... how can more things requiring security fixes be a bad thing, while you arbitrarily want more layers between you and the most security tested code for networking available to you.
Your characterization of Linux further suggests the exercise above would be a great experience.
> I want to eventually address in my own startup.
You worked on CloudRun and their performance is dogshit. Seriously google it theres like 100 stack overflow questions on the subject. It's common enough a query Google even suggests follow up questions like: "Why is cloud run so slow?".
Now your answer might be "avoid syscalls", "don't do anything on the file system (oh by the way your file system is memory mapped hehe)", "interpreters can be slow to load their code, sorry", "look at these charts its not as bad as you say", "tcp overhead is only 30%", etc but your next set of customers wont have the same vendor lock in you enjoyed at Google.
Then do the same query for "Digital Ocean Apps slow", also gvisor. And bam you'll have a long list of customers ready to use your better version! Perhaps Google and Digital Ocean will enlist your expertise (again).
The netstack stuff here has nothing to do with the rest of gVisor.
How so? Besides being part of it, it is at least similar in the group of "bloated slow userland implementation of things the kernel handles well"
The gVisor/perf thing is a tendentious argument. You can have whatever opinion you like about whether running a platform under gVisor supervision is a good idea. But the post we're commenting on is obviously not about gVisor; it's about a library inside of gVisor that is probably a lot more popular than gVisor itself.
You'll note their node/ruby benchmarks showed a substantially bigger performance hit. That's because the other gvisor sandboxing functionality (general syscall + file I/O) has more of an impact on performance, but also because these are network-processing bound applications (rare) that were still reaching high QPS in absolute terms for their perspective runtimes (do you know many real-world node apps doing 350qps-800qps per instance?).
Because coder is not likely to be bottlenecked by CPU availability for networking, the resource overhead should be inconsequential, and what's really important is the impact on user latency. But that's something likely on the order of 1ms for a roundtrip that is already spending probably 30-50ms at best in transit between client and server (given that coder's server would be running in a datacenter with clients at home or the office), plus the actual application logic overhead which is at best 10ms. And that's very similar to a lot of gvisor netstack use cases which is why it's not as big of a deal as you think it is.
TLDR: For the stuff you'd actually care about (roundtrip latency) in the coder usecase the perf hit of using gvisor netstack should be like 2% at most, and most likely much less. Either way it's small enough to be imperceivable to the actual human using the client.
but gvisor was using full runsc for the networking benchmarks I linked, and IIUC runc's networking should be sufficiently similar to unsandboxed networking that I believe runsc<->runc network performance difference should approximate gvisor netstack<->vanilla kernel networking.
But after I left, I heard a that alot of the poor performance of Cloud Run is just plain old oversubscribed shared core e2 stuff.
I'll fully grant that that seems to be the norm for everything browser related. Policies got difficult to install new software, just point your browser to this url and call it a day.
Arguably, this basic phenomenon has been going on for 20+ years. A lot of people by 2005-2007 or so had come to belive (and probably correctly) that a lot of the impetus for adopting SOAP based web-services over the preceding few years was simply because everything ran over ports 80 and 443 which were already open in the firewall. So deploying a remote service this way was more tractable than submitting a request to allow access to yet another port in firewall, and deal with the inevitable bureaucratic nightmare of getting that approved.
https://www.kernel.org/doc/Documentation/networking/ip-sysct...
To answer the upstream question about why arbitary outbound connections are allowed, they're not. This is connecting to a cloud development environment, and I would have to assume this service can be self-hosted, because on a classified network, the "cloud" isn't the cloud as Hacker News readers know it. Amazon et all run private data centers on US military installations that only the military and the IC can access and they're airgapped from the Internet. If you're on a workstation that can access this environment, that's all it can access. The only place you can exfiltrate data to is other military-controlled servers.
Interesting to dismiss it as such. The gvisor netstack is a (big) part of gvisor and this article is discussing how the performance of that component was, and could well still be, garbage.
These tools bring marginal capability and performance gains, shoved down peoples throat by manufacturing security paranoia. Oh an it all happens to cost you like 10x time, but look at the shiny capabilities, trust me it couldn't be done before! A netsec and infra peddlers wet dream.
The article and a related GitHub discussion (linked from TFA) points out that the default congestion algorithm (reno) wasn't good for long-distance (over Internet) workloads. The gvisor team never noticed it because they test/tune for in-datacenter usecases.
> These tools bring marginal capability and performance gains
I get your point (ex: app sandbox in Android ruins battery & perf, website sandbox on chrome wastes memory, etc). While 0-days continue to sell for millions, opsec are right to be skeptical about a very critical component (kernel) that runs on 50%+ of all servers & personal devices.
So the PaaS providers mentioned in that comment should be assumed to be compromised?
As I understand the only reason you'd use a TUN interface is if you want to send/receive raw IP packets. Their marketing doesn't make it very clear what their product does, but I can't see a reason it would need to send/receive raw IP packets rather than TCP/UDP packets over a specific port...
I surmise that the reason might be that a user space tunnel might be faster (like maybe they can do UDP over TCP or something to gain speed improvements).
Good post nevertheless.
If your C doesn't fight the scheduler it isn't that bad.
On a goroutine not locked to an OS thread (the default), don't take more than 1 microsecond in a single C call. If you need to take longer in C, lock the goroutine to an OS thread (runtime.LockOSThread), but then don't do things in Go that would park that goroutine (time.Sleep, blocking channel read/write, etc).
1. An argument that a tool using netstack is in any way tainted with gVisor's runtime costs.
2. An argument that shared-kernel multitenant is tenable and thus gVisor addresses no meaningful security concerns.
If you're talking about production machines, a userspace application wouldn't be able to sniff privileged ports without elevated permissions, so I fail to see how this application would let you get around that limitation.
Postgres and Redis can use non-privileged ports, so I don't understand why this would matter.
If you're running on a system you don't administrate that has ports under 1024 set as privileged, there's no way(with or without your cli) to have a userspace program receive TCP or UDP packets coming into the kernel from external devices for these ports(unless I'm completely mistaken).
What can you accomplish with "user-mode TCP/IP" that you can't from userspace with system calls?
With this CLI I am able to listen for external packets to port 80 from userspace without any elevated permissions and intercept traffic that's going to an application that's bound to that port on the OS?
Edit: I think I understand what you're trying to do, but if I do then traffic is going from the kernel UDP stack to the userland TCP stack, back to the UDP kernel stack. Not sure how that avoids sending the packet to the kernel. If it's to get around the port restrictions, why can you not just use unprivileged ports?
Also are we just ignoring that you pretended VMs were expensive to run? Most of your responses sound devoid of a lot of fundamental computer knowledge(networking and otherwise).
I understand that, I just don't understand any case where that's desirable...
We have 2^32 ports available to applications(and a special `0` port that can be used to request any port) on a single IP(which is usually shared between multiple machines). I have never heard of a case where 2^32 ports is not enough ports for the number of applications that need to be listening.
> To the OS, it's all just ordinary socket code.
Which is what I don't understand. Why not just use ordinary socket code without all of these additional LoC in between that open you up to more bugs(security and functionality).
You can't if your organization prevents you to, for example.
You don't want to if you follow strict rules which are not enforced by the OS, again for example.
I'll offer the same caveat here, btw, I am not trying to torpedo the idea of trying this. I'm genuinely curious why you would need to do this. Not necessarily why you would want to.
* Do user-mode WireGuard (and thus TCP/IP) and talk "natively" to the infrastructure deployed on our platform.
* Write case-by-case application gateways for each of those pieces of infrastructure tunneled somehow through HTTP.
And if you don't want to, that feels misguided?
Granted, my old recollection was largely that the "privileged" ports were that way because they were blessed by the routing tables, at the time. The entire point was that the lower ports were expected to be connectable to external machines. Not shocking if I am out of date there.
I should hasten to add that I am not offering this as reason this shouldn't be done.
(This is also a new use of "moot" to me? You seem to be offering it as a synonym of obsolete? But a "moot" debate is one that is closer to "overcome by events" than one that is not relevant. Right?)
Respectfully, if at this point the situation hasn't been made clear to you, I don't think there's much more to productively discuss.
My impression from something said elsewhere was that this was largely for internal tools. I'm not sure why I got that impression, though.
I think it is fair, btw, that I would be pushing for both paths, at this point? If a long standing network policy rule has become obsolete from advance, it is worth considering dropping it? Is that not something people are looking at?
(I will also note that I will not be at all offended if you drop out from lack of interest here. Apologies if you feel I was wasting your time!)
Some potential options for "what is it for" come up, and others bring up reasons why they don't make sense.
It seems this is a solution to a very specific problem that nobody seems to have, which is why when people are trying to figure out what problem it solves they're coming up with 10 better solutions.