Achieving 5M persistent connections with Project Loom virtual threads

Achieving 5M persistent connections with Project Loom virtual threads(github.com)

309 points by genzer 4 years ago | 145 comments

I think a lot of people are missing the point.

Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)

If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.

You no longer have to worry about any of that.

pjmlp 4 years ago | |

Or inserting the occasional Task.Run() calls, as means to avoiding changing the whole call stack up to Main().

gavinray 4 years ago | | |

This hasn't been that much of a problem, IME

If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value

bullen 4 years ago | |

Agreed it's simpler, but using NIO with one OS thread per core also has it's benefits.

The context switch (how ever small) will cause latency when this solution is at saturation.

I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.

Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.

pron 4 years ago | | |

https://github.com/ebarlas/project-loom-comparison

blibble 4 years ago | | |

there's still a context switch with NIO, you're just doing it manually

SemanticStrengh 4 years ago | |

Except Kotlin coroutines already works, can be very easily integrated in existing java codebases and are much superior than loom (structured concurrency, flow, etc)

richdougherty 4 years ago | | |

Kotlin coroutines are amazing. They're built on very clever tech that converts fairly normal source code into a state machine when compiled. This has huge benefits and allows the programmer to break their code up without the hassle of explicitly programming callbacks, etc.

https://kotlinlang.org/spec/asynchronous-programming-with-co...

However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.

Having lightweight thread and continuations support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.

I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...

I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.

Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.

Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.

https://kotlinlang.org/docs/functions.html#tail-recursive-fu...

https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....

https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...

As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?

"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:

* Virtual threads

* Delimited continuations

* Tail-call elimination"

https://wiki.openjdk.java.net/display/loom/Main

pron 4 years ago |

For more information about virtual threads see https://openjdk.java.net/jeps/425 (planned to preview in JDK 19, out this September).

What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.

midislack 4 years ago |

I see a lot of these making the FP of HN. But it's very difficult to be impressed, or unimpressed because it's all about hardware. How much hardware is everybody throwing at all of this? 5M persistent connections on a Pi with mere GigE? Pretty frickin' amazing. 5M persistent connections on a Threadripper with 128 cores and a dozen trunked 4 port 10GE NICs? Yaaaaawwwnnn snooze.

We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.

jpollock 4 years ago | |

This isn't about the hardware, it's about thread count.

There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.

From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.

In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.

Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.

If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.

shadowpho 4 years ago | |

Raspberry pi 4 performance changes wildly based on cooling. Bare die vs heatsink vs heatsink + fan will give you wildly different results.

midislack 4 years ago | | |

Same is true with any computer these days. So let's go no heat sink, Pi 4 4GB anyway.

niederman 4 years ago | |

> Everybody can find one

LMAO I wish.

https://rpilocator.com/?cat=PI4

kmelva 4 years ago | |

Could a 128c Threadripper even do 5M kernel threads?

TYMorningCoffee 4 years ago |

I was only able to get to 840,000 open connections with my experiment. My machine only has 8GB of memory. https://josephmate.github.io/2022-04-14-max-connections/

Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.

mh- 4 years ago | |

no*, and as you've discovered, the skbufs allocated by the kernel will often be the limiting factor for a highly concurrent socket server on linux.

* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.

edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569

toast0 4 years ago | |

Does Linux actually allocate buffers for each socket or does it just link to sk_buff's (which I understand are similar to FreeBSD's mbuf's) and then limit how much storage can be linked? FreeBSD has a limit on the total ram used for mbufs as well, not sure about Linux.

Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.

When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.

Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.

Matthias247 4 years ago | | |

I think the socket buffers (sk_buff) are actually shared. They are all packet sized, and whatever socket needs to transmit some data or receives it gets the buffers attached. So my assumption is that the amount of required socket buffers scales more with the amount of data transmission than with the number of sockets.

But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.

sgtnoodle 4 years ago |

I'm not a java programmer. I tried clicking 3 layers deep of links, but still have no idea what virtual threads are in this context. Is it a userspace thread implementation?

I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)

grishka 4 years ago | |

> Is it a userspace thread implementation?

Yes.

christophilus 4 years ago |

Loom looks like it’s nicely solved the function coloring problem. This plus Graal makes me excited to pick up Clojure again.

10000truths 4 years ago |

A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.

A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.

A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.

metabrew 4 years ago |

API for the server example looks... actually good, wow. Nice job!

Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.

nelsonic 4 years ago |

Reminds of https://phoenixframework.org/blog/the-road-to-2-million-webs... Would love to see this extended to more Languages/Frameworks.

wiradikusuma 4 years ago |

The experiment is about Java app, but the tweaks are at the O/S level. Does it mean any app (Java/not, Loom/not) can achieve target given correct tweak?

Also, why are these not default for the O/S? What are we compromising by setting those values?

invalidname 4 years ago |

This is pretty fantastic!

I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.

isbvhodnvemrwvn 4 years ago | |

Spring Boot overhead would likely make that infeasible.

RhodesianHunter 4 years ago | | |

Spring boot overhead is largely in startup time. It really doesn't have much overhead there after.

It's largely a collection of the same libraries you would use anyways glued together with a custom di system.

invalidname 4 years ago | | |

I'm not saying 5M. I just want to see to what scale it would get without threading issues. Spring Boot isn't THAT heavy.

the8472 4 years ago |

   net.netfilter.nf_conntrack_buckets = 1966050
   net.netfilter.nf_conntrack_max = 7864200

or avoid conntrack entirely

LinuxBender 4 years ago | |

For completeness sake I would add that one must also set

  options nf_conntrack expect_hashsize=X hashsize=X

in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_max

alberth 4 years ago |

Is this a test of just having 5M people knock on your door?

Or is this a test where something actually happens (data exchanges) with each connection?

I ask because those are two totally different workloads and typically where in the later test Erlang shines.

bufferoverflow 4 years ago | |

It's an echo server. The client sends the data, the server responds with the same data.

imranhou 4 years ago |

It looks more closer to go routines, which to me begs the question - where are the channels that I could use to communicate between these virtual threads?

adra 4 years ago | |

Go's channels are simplistically a mutex in front of a queue. Java has many existing objects that can do the same, it's just that's not idiomatic best choice to do the same. Since green threads should wake up from Object.notify(), any threads blocking on the monitor should wake/consume. I'm curious how scalable/performance a green thread ConcurrentDequeue would stand up to go's channel.

Matthias247 4 years ago | | |

You are right. But Go Channels come also with the superpower of „select“, which allows to wait for multiple objects to become ready and atomic execution of actions. I don’t think this part can be retrofitted on top of simple BlockingQueues.

sdfgdfgbsdfg 4 years ago | |

In a library. Loom is more about adapting the JVM itself for continuations and virtual threads than adding to userspace.

Andrew_nenakhov 4 years ago |

Sounds like a job for Erlang.

speed_spread 4 years ago | |

Sounds like Erlang's out of a job.

Andrew_nenakhov 4 years ago | | |

No.

deepsun 4 years ago |

How does that compare to Kotlin suspend functions?

wiseowise 4 years ago |

And how is that any different from Kotlin coroutines if you still need to call Thread.startVirtualThread?

pron 4 years ago | |

1. These are actual threads from the Java runtime's perspective. You can step through them and profile them with existing debuggers and profilers. They maintain stacktraces and ThreadLocals just like platform threads.

2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.

Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425

ferdowsi 4 years ago | |

Kotlin coroutines are colored and infect your whole codebase. Virtual threads do not.

wiseowise 4 years ago | | |

You can mark everything suspend and there's no difference.

pjmlp 4 years ago | |

Native VM support instead an additional library faking it, and filling .class files with needless boilerplate.

KingOfCoders 4 years ago |

Something to learn for everybody, the article is mainly about Linux tuning.

jeroenhd 4 years ago | |

The Linux tuning part seems to have been inspired by these blog posts from 14 years ago: https://www.metabrew.com/article/a-million-user-comet-applic...

It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.

toast0 4 years ago | | |

I mean... is 5M very impressive? Not really. Does it show that Project Loom meets the goal of being able to do large client count thread per server workloads? I think so. Does the name remind me of a best selling point and click adventure game? Definitely yes.

torginus 4 years ago |

While impressive, I don't really see it as something practical - I think scaling across processes/VMs is a much more realistic approach.

zinxq 4 years ago |

Loom sets out to give you a sane programming paradigm similar to what threads do (i.e. as opposed to programming asynchronous I/O in Java with some type of callback) without the overhead of Operating System threads.

That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.

I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.

notorandit 4 years ago |

With a maximum of 64k TCP connections per single server IP, you need 77 different IP on the server side. This is a fact.

Nullabillity 4 years ago |

Loom is missing the point.

Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.

I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.

Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.