Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)
If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.
You no longer have to worry about any of that.
If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value
The context switch (how ever small) will cause latency when this solution is at saturation.
I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.
Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.
https://kotlinlang.org/spec/asynchronous-programming-with-co...
However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.
Having lightweight thread and continuations support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.
I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...
I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.
Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.
Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.
https://kotlinlang.org/docs/functions.html#tail-recursive-fu...
https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....
https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...
As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?
"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:
* Virtual threads
* Delimited continuations
* Tail-call elimination"
What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.
We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.
There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.
From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.
In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.
Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.
If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.
LMAO I wish.
Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.
* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.
edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569
Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.
When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.
Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.
But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.
I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)
Yes.
A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.
A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.
Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.
Also, why are these not default for the O/S? What are we compromising by setting those values?
I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.
It's largely a collection of the same libraries you would use anyways glued together with a custom di system.
net.netfilter.nf_conntrack_buckets = 1966050
net.netfilter.nf_conntrack_max = 7864200
or avoid conntrack entirely options nf_conntrack expect_hashsize=X hashsize=X
in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_maxOr is this a test where something actually happens (data exchanges) with each connection?
I ask because those are two totally different workloads and typically where in the later test Erlang shines.
2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.
Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425
It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.
That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.
I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.
Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.
I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.
Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.
It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?
There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:
1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.
2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.
3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.
I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).
One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.
In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.
For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.
Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.
IMHO it's only JVM+Graal that can bring this to other languages. Loom relies very heavily on some fairly unique aspects of the Java ecosystem (Go has these things too though). One is that lots of important bits of code are implemented in pure Java, like the IO and SSL stacks. Most languages rely heavily on FFI to C libraries. That's especially true of dynamic scripting languages but is also true of things like Rust. The Java world has more of a culture of writing their own implementations of things.
For the Loom approach to work you need:
a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.
b. The compiler/runtime to control all code being used. The moment you cross the FFI into code generated by another compiler (i.e. a native library) you have to pin the thread and the scalability degrades or is lost completely.
But! Graal has a trick up its sleeve. It can JIT compile lots of languages, and those languages can call into each other without a classical FFI. Instead the compiler sees both call site and destination site, and can inline them together to optimize as one. Moreover those languages include binary languages like LLVM bitcode and WASM. In turn that means that e.g. Python calling into a C extension can still work, because the C extension will be compiled to LLVM bitcode and then the JVM will take over from there. So there's one compiler for the entire process, even when mixing code from multiple languages. That's what Loom needs.
At least in theory. Perhaps pron will contradict me here because I have a feeling Loom also needs the invariant that there are no pointers into the stack. True for most languages but not once C gets involved. I don't know to what extent you could "fix" C programs at the compiler level to respect that invariant, even if you have LLVM bitcode. But at least the one-compiler aspect is not getting in the way.
For application level, it's going to depend on how you handle concurrency. This post is interesting, because it's a benchmark of a different way to do it in Java. You could probably do 5M connections in regular Java through some explicit event loop structure; but with the Loom preview, you can do it connection per Thread. You would be unlikely to do it with connection per Thread without Loom, since Linux threads are very unlikely to scale so high (but I'd be happy to read a report showing 5M Linux threads)
Some back of the envelope maths: https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million
If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.
I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...
It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...
- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.
- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory
- green threads are faster to execute
Here's the link:
https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-...
For those who don't understand this, Kotlin's co-routine framework is designed to be language neutral and already works on top the major platforms that have kotlin compilers (native, javascript, jvm, and soon wasm). So, it doesn't really compete with the "native" way of doing concurrent, aynchronous, or parallel computing on any of those platforms but simply abstracts the underlying functionality.
It's actually a multi platform library that implements all the platform specific aspects in the platform appropriate way. It's also very easy to adapt existing frameworks in this space via Kotlin extension functions and the JVM implementation actually ships out of the box with such functions for most common solutions on the JVM for this (Java's threads, futures, threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom will be just another solution in this long list.
If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.
With Kotlin-js in a browser you can call Promise.toCoroutine() ans async { ... }.asPromise(). That makes it really easy to write asynchronous event handling in a web application for example or work with javascript APIs that expect promises from Kotlin. And if you use web-compose, fritz2, or even react with kotlin-js, anything asynchronous, you'd likely be dealing with via some kind of co-routine and suspend functions.
Once Loom ships, it basically will enable some nice, low level optimization to happen in the JVM implementation for co-routines and there will likely be some new extension functions to adapt the various new Java APIs for this. Not a big deal but it will probably be nice for situations with extremely large amounts of co-routines and IO. Not that it's particularly struggling there of course but all little bits help. It's not likely to require any code updates either. When the time comes, simply update your jvm and co-routine library and you should be good to go.
I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.
Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.
So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.
Writing the sort of applications that I get involved with, it's frequently the case whilst it's true that 1 OS thread/java thread was a theoretical scalability limitation - in practice we were never likely to hit it (and there was always the 'get a bigger computer').
But: the complexity mavens inside our company and projects we rely upon get bitten by an obsessive need to chase 'scalability' /at all costs/. Which is fine, but the downside to that is the negative consequences of coloured functions comes into play. We end up suffering having to deal with vert.x or kotlin or whatever flavour-of-the-month solution is that is /inherently/ harder to reason about than a linear piece of code. If you're in a c# project, the you get a library that's async, and boom, game over.
If loom gets even within performance shouting distance of those other models, it's ought to kill (for all but the edgiest of edge-cases) reactive programming in the java space dead. You might be able to make a case - obviously depending on your use cases which are not mine - that extracting, say, 50% more scalability is worth the downsides. If that number is, say, 5%, then for the vast majority of projects the answer is going to be 'no'.
I say 'ought to', as I fear the adage that "developers love complexity the way moths love flames - and often with the same results". I see both engineers and projects (Hibernate and keycloak, IIRC) have a great deal of themselves invested in their Rx position, and I already sense that they're not going to give it up without a fight.
So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!
I still attest though - The 5M connections in this example is still a red herring.
Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.
Loom and Java NIO can handle probably a billion connections as programmed. Java Threads cannot - although that too is a broken statement. "Linux Threads cannot" is the real statement. You can't have that many for resource reasons. Java Threads are just a thin abstraction on top of that.
Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.
Don't get me wrong - I think Loom is cool. It's attempted to do the same thing as Async/Await tried - just better. But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.
*We typically vilify Java Threads for the Ram they consume. Something like 1M per thread or something (tunable). Loom must still use "some" ram per connection although surely far far less (and of course Linux must use some amount of kernel ram per connection too).
Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.
Are you thinking of the ephemeral port limit? That's on the client side; not the server side. Each TCP socket pair is a four-tuple of [server IP, server port, client IP, client port]; the uniqueness comes from the client IP/port part in the server case.
The real problem with such a setup is that you're not left with a whole lot of bandwidth per connection, even if you ignore things like packet loss and retransmits mucking up the connections. Most VPS servers have a 1gbps connection, with 5 million clients that leaves 200 bytes per second of concurrent bandwidth for TCP signaling and data to flow through. You'll need a ridiculous network card for a single server to deal with such a load, in the terabits per second range.
Cloudflare has some interesting blog posts on this topic:
- https://blog.cloudflare.com/how-we-built-spectrum/
- https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...
If you suppose just one open server port, you’ll probably need 77 client ips to do this test to get unique socket pairs.
But it’s a client problem, not a server one.
Clients can connect to the server on the same server port, so connection limit is more like 64k*2 for every Client IP-Server IP pair.
Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.
Go is just yet another implementation of green threads that is slightly less broken than prior implementations, because it had the benefit of being implemented on day 1 (so the whole ecosystem is green thread-aware). It's certainly nowhere near "best-in-class".
Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.
Give me some async code and I'll show you an easier threaded version.
I don't find myself missing out on futures in Go.
Saying "Linux cannot handle 5M connections with one thread per connection" isn't a reasonable statement because no operating system can do that, they can't even get close. The resource usage of a kernel thread is defined by pretty fundamental limits in operating system architecture, namely, that the kernel doesn't know anything about the software using the thread. Any general purpose kernel will be unable to provision userspace with that many threads without consuming infeasible quantities of RAM.
The reason JVM virtual threads can do this is because the JVM has deep control and understanding of the stack and the heap (it compiled all the code). The reason Loom scalability gets worse if you call into native code is that then you're back to not controlling the stack.
Getting to 10M is therefore very much a question for the JVM as well as the operating system. It'll be heavily affected by GC performance with huge heaps, which luckily modern G1 excels at, it'll be affected by the performance of the JVM's userspace schedulers (ForkJoinPool etc), it'll be affected by the JVM's internal book-keeping logic and many other things. It stresses every level of the stack.
It is the only way to achieve that many connections with Java in a way that's debuggable and observable by the platform and its tools, regardless of its intuitiveness or friendliness to human programmers. It's important to understand that this is an objective technical difference, and one of the cornerstones of the project. Computations that are composed in the asynchronous style are invisible to the runtime. Your server could be overloaded with I/O, and yet your profile will show idle thread pools.
Virtual threads don't just allow you to write something you could do anyway in some other way. They actually do work that has simply been impossible so far at that scale: they allow the runtime and its tools to understand how your program is composed and observe it at runtime in a meaningful and helpful way.
One of the main reasons so many companies turn to Java for their most important server-side applications is that it offers unmatched observability into what the program is doing (at least among other languages/platforms with similar performance). But that ability was missing for high-scale concurrency. Virtual threads add it to the platform.
Supporting tooling has been one of the most important aspects of this project, because even those who were willing to write asynchronous code, and even the few who actually enjoyed it, constantly complained — and rightly so — that they cannot easily observe, debug and profile such programs. When it comes to "serious" applications, observability is one of the most important aspects and requirements of a system.
Instead of introducing new kind of sequenatial code unit through all layers of tooling — which would have been a huge project anyway, we abstracted the existing thread concept.
Erlang is maximal shared mutable state!
Processes are mutable state and they’re shared between other processes.
From the article, it seems that Loom (in preview) enables the threaded model for Java to scale. IMHO, this is great because you can write simple straightforward code in a threaded model. You can certainly write complex code in a threaded model too. Maybe there's an argument that promises can be simple and straightforward too, but my experience with them hasn't been very straightforward.
Memory contention is also playing into this.
The benchmark they made is asking the question in a way that it leans into the answer they need, just like 99% of all human activity it's biased.
with NIO you are still managing the stack, just yourself instead of letting the operating system do it for you
it is still a "context switch", just done in your code instead of the OS
and that's not free (and likely more expensive than saving and restoring a set of registers)
I'm sure I'm skipping over tons of complexity here (HTTP keepalives binding clients to a single attachment host for example) because I'm no chat app developer, but the theoretical complexity is still relatively low.
But in the real world it is common to need information from the authorization stage to use in the authentication stage. For example you may have a user login with an email address/password which you then pass to an LDAP server in order to get a userId. This userId is then used in a database to determine with objects/groups they have access to.
So - C code running on the JVM via Sulong keeps C/C++ semantics. That probably means you can build pointers into the stack, and then I don't know what Loom would do. Right now they aren't integrated so I guess that's a research question.
My way out of depth idea with Sulong is that it uses small heap-allocated regions for every manual memory usage (it even has a Managed mode in Enterprise).
Sulong uses a standard C-style heap in the open source version. In EE they (can) trap malloc/free and re-point it towards the GCd heap. They also do bounds checking on pointer de-references. It's actually amazingly cool but unfortunately, EE is expensive enough in dollar terms that it gets ignored. I don't know of anything that uses it for real.
But having "only" tens of thousands of connections per client is rarely a problem in practice, apart from some load testing scenarios (such as the experiment here, where they opened a number of ports so they could test a large number of connections with a single client machine).
Very few people know it but Oracle is developping an alternative to Loom, in parallel. https://github.com/oracle/graal/pull/4114
BTW i expect Kotlin coroutines to leverage loom eventually.
As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow. Few people know there is an alternative to tailrecursive, that can make any function stackoverflow safe by leveraging the heap via continuations https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deep-re...
As for Java, there is universal support for tail recursion at the bytecode level https://github.com/Sipkab/jvm-tail-recursion
I've been using an IntelliJ extension that can do magic by rewriting recursive functions to stateful stack-based code for performance, but it spits out very ugly code:
https://github.com/andreisilviudragnea/remove-recursion-insp...
> "This inspection detects methods containing recursive calls (not just tail recursive calls) and removes the recursion from the method body, while preserving the original semantics of the code. However, the resulting code becomes rather obfuscated if the control flow in the recursive method is complex."
It was this guy's whole Bachelor thesis I guess:https://github.com/andreisilviudragnea/remove-recursion-insp...
Only because the compiler does its magic behind the scenes and transforms it into bytecode that takes a lambda with a continuation. Try calling a suspend function from java or starting a job and surprise, it's continuations all the way down
I think another commenter pointed out that they are still coloured though. Still, they're very cool - and you can use them for more than just lightweight threading.
> As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow
I'd say tailrecursive is compiler feature (codegen the recursion into a loop) to work around a runtime contraint (no tail call optimisation).
The lack of tail call optimisation on the JRE means recursion is a lot less safe than in functional language runtimes which guarantee stacks don't overflow when you make tail calls.
> As for Java, there is universal support for tail recursion at the bytecode level.
Just a note here for other readers that there are several terms in play here.
I was talking about "tail calls" - when a function calls a function as its last operation - and I mentioned some annotations to do "tail recursion", which is a special case - when a function calls _itself_ as its last operation.
SemanticStrength is talking about "tail recursion" only here. The JVM bytecode can support tail recursion (tail calls on the same method), since we can use the same bytecode that is used for while loops, etc.
However, we cannot do safe tail recursion between different functions (yet), in the same way that we cannot have a loop spanning more than one function. Tail call optimisation is something that will hopefully come in Project Loom.
See the hard system call wrapping. This is just one option.
1. Demanding scalability for inappropriate projects and at any cost is something I've seen too, and on investigation it was usually related to former battle scars. A software system that stops scaling at the wrong time can be horrific for the business. Some of them never recover, the canonical example being MySpace, but I've heard of other examples that were less public. In finance entire multi-year IT projects by huge teams have failed and had to be scrapped because they didn't scale to even current business needs, let alone future needs. Emergency projects to make something "scale" because new customers have been on-boarded, or business requirements changed, are the sort of thing nobody wants to get caught up in. Over time these people graduate into senior management where they become architects who react to those bad experiences by insisting on making scalability a checkbox to tick.
Of course there's also trying to make easy projects more challenging, resume-driven development etc too. It's not just that. But that's one way it can happen.
2. Rx type models aren't just about the cost of threads. An abstraction over a stream of events is useful in many contexts, for example, single-threaded GUIs.
And sure, if you are living in a single-threaded environment, your choices are somewhat limited. I, personally, dislike front-end programming for exactly that reason - things like RxJS feel hideously overcomplicated to me. My guess is that most, though not all, will much prefer the loom-style threading over async/await given free choice.
JEP 425 has been proposed to target JDK 19, out September 20. It will first be a "Preview" feature, which means supported but subject to change, and if all goes well would normally be out of Preview two releases, i.e. one year, after that.
> I'm not using one request per Java thread anyway
You don't have to, but not that only the thread-per-request model offers you world-class observability/debuggability.
> other than "ugh, this again".
Ok, although in 2022, the Java platform is still among the most technologically advanced, state-of-the art, software plarform out there. It stands shoulder to shoulder with clang and V8 on compilation, and beats everything else on GC and low-overhead observability (yes, even eBPF).
The point is with Loom you can, and you can stop putting everything into a continuation and go back to straight-line code.
It's simpler and nicer, actually — and definitely offers better tooling and observability — especially with structured concurrency: https://download.java.net/java/early_access/loom/docs/api/jd...
Thanks for the response and the amazing work!
The point I was making is that Loom isn't released, stable, production ready, supported, etc, and there's no still no date when it's supposed to be, so what you can do with Loom in no way affects what I can do with a production codebase, either new or legacy. I'm not sure how you missed that from my post.
I'm not defending reactive programming on the JVM. I'm also not defending threads as units of concurrency. I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM, and I can't reasonably pick Project Loom if I want something stable and supported by its creators.
September 20 (in Preview)
> I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM
Only sort-of. The only languages offering something similar in terms of programming model are Erlang (/Elixir) and Go — both inspired virtual threads. But Erlang doesn't offer similar performance, and Go doesn't offer similar observbility. Neither offers the same popularity.
Meanwhile, as you say, Erlang/Elixir gives me this model with 35+ years of history behind it (and no libraries/frameworks in use trying to provide me a leaky abstraction of something 'better'), better observability than the JVM, a safer memory model for concurrent code, a better model for reliability, with the main issue being the CPU hit (less of a concern for IO bound workloads, which is where this kind of concurrency is generally impactful anyway). Go has reduced observability than Java, sure, but a number of other tradeoffs I personally prefer (not least of all because in most of the Java shops I was in, I was the one most familiar with profiling and debugging Java. The tools are there, the experience amongst the average Java developer isn't), and will also be releasing twice between now and next year.
Again, I'm not saying virtual threads from Loom aren't cool (in fact, I said they were; the technical achievement of making it a drop in replacement is itself incredible), or that it wouldn't be useful when it releases for those choosing Java, stuck with Java due to legacy reasons, or using a JVM language that is now able to migrate to take advantage of this to remove some of the impedance mismatch between their concurrency model(s) and Java's threading and the resulting caveats. Just that I don't care until it does (because I've been hearing about it for the past 4 years), it still doesn't put it on par with the models other languages have adopted (memory model matters to me quite a bit since I tend to care about correct behavior under load more than raw performance numbers; that said, of course, nothing is preventing people from adopting safer practices there...just like nothing has been in years previous. They just...haven't), nor do I care about the claims people make about it displacing X, Y, or Z. It probably will for new code! Whenever it gets fully supported in production. But there's still all that legacy code written over the past two decades using libraries and frameworks built to work around Java's initial 1:1 threading model, and which simply due to calling conventions and architecture (i.e., reactive and etc) would have to be rewritten, which probably won't happen due to the reality of production projects, even if there were clear gains in doing so (which as the great-grandparent mentions, is not nearly so clearcut).
As to legacy code, Java programs have been using the thread-per-request model for over 25 years (there's been a lot of talk of reactive, but actual adoption is relatively low), and Java's threads were designed to be abstracted from day one (in fact, early versions of Java implemented them in user mode). So the right fit has been there all along. Migrating applications to use virtual threads requires relatively few changes because of those reasons, and because we designed them with easy adoption in mind. This particular experiment is about simple, "legacy" Java 1.0 code enjoying terrific scalability.
BTW, Java's observability has come a long way in recent years (largely thanks to JFR — Java Flight Recorder), and even Erlang's is no match for it, although Java still lags behind Erlang's hot-swapping capabilities.
[1]: BTW, I always find talk about the "average Java programmer" a bit out of touch. The top 1% of Java programmers, the experts, outnumber all Rust (or Haskell, or Erlang) programmers several times over, and there are many more reliable Java programs than reliable Erlang programs. The average Java (or Python, or JavaScript, the two other dominant languages these days) programmer, is just the average programmer, period.