Debugging a memory leak in a Clojure service(charanvasu.com) |
Debugging a memory leak in a Clojure service(charanvasu.com) |
If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.
An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.
So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.
To add to everything you said, depending on the type of framework you are using sometimes you don't even want it to do that. The JVM will try increasingly desperate measures, looped GC scans, ref processing, and sleeps with backoffs. With a huge heap, that can easily take hundreds to thousands of ms.
At scale, it's often better to just kill the JVM right away if the heap fills up. That way your open connections don't have all that extra latency added before the clients figure out something went wrong. Even if the JVM could recover this time, usually it will keep limping along and repeating this cycle. Obviously monitor, collect data, and determine the root cause immediately when that happens.
That said trying to enforce overhead limits with RSS limits also won’t end well. Java doesn’t make guarantees around max allocated but unused heap space. You need something like this: https://github.com/bazelbuild/bazel/blob/10060cd638027975480... - but I have rarely seen something like that in production.
Folks insisting in using Java 11 or worse, Java 8, for containers are in for a surprise.
This on OpenJDK, as sibling comment points out, there are other JVMs as well.
1) Digging in to Clojure library source code is unsettlingly easy. Clojure's core implementation has 2 layers - a pure Clojure layer (which is remarkably terse, readable and interesting) and a Java layer (which is more verbose). RT (Runtime) happens to be one of the main parts of the Java layer. The experience of looking into a clojure.core function and finding 2-10 line implementation is normal.
2) Code maintenance is generally pretty easy. In this case the answer was "don't use eval" and I've had a lot of good experiences where the answer to a performance problem is similarly basic. The language tends to be responsible about using resources.
> Once we moved to other tasks, we started seeing the pods go OOMKilled. We took turns looking into the issue, but we couldn’t determine the exact cause.
As a particular “yay clojure” kind of moment.
This was an obscure bug/“feature” in the clojure standard library. That’s not normal, and having to dig into the clojure standard library, even if it only a line or two, is certainly not something I’d be particularly calling out as standard practice or “easy” maintenance.
The standard library is for the most part enormously reliable.
You should almost never have to do this.
The best insight into the operation of the JVM is now obtained via a single mechanism, JFR (https://dev.java/learn/jvm/jfr/), the JDK's observability and monitoring engine. It records a whole lot of event types: https://sap.github.io/SapMachine/jfrevents/
See here for examples related to tracking memory: https://www.morling.dev/blog/tracking-java-native-memory-wit...
1. I’m having a bit of trouble parsing this paragraph:
> The reason eval loads a new classloader every time is justified as dynamically generated classes cannot be garbage collected as long as the classloader is referencing to them. In this case, single classloader evaluating all the forms and generating new classes can lead to the generated class not being garbage collected.
To avoid this, a new classloader is being created every time, this way once the evaluation is done. The classloader will no longer be reachable and all it’s dynamically loaded class.
It sounds like the solution they adopted was to instantiate a brand new classloader each time a dynamic class is evaluated, rather than use a singleton classloader for the app’s lifetime.
That said, their little eval misadventure has alerted me to the details of how Clojure's eval works. I learned something today, thanks OP.
The only way to eliminate its massive cost is to code the way game programmers in managed languages do and not generate any garbage, in which case GC doesn't even help you very much.
What should be hard about app scalability and performance is scaling up the database and dealing with fundamental difficulties of distributed systems. What is actually hard in practice is dealing with the infinite tower of janky bullshit the Clean Code Uncle Bob people have forced us to build which creates things like massive GC overhead that is impossible to eliminate with totally rewriting or redesigning the app.
Related discussion on SO: https://stackoverflow.com/questions/37109924/if-getter-sette...
Memory allocation/deallocation overhead is always present, just look at different allocators, fragmentation issues and so on. Using a GC is not intrinsically much different performance wise.
Initial Clojure implementation was checking for an already created classloader and tried to reuse. They had commented out the code that was doing it.
Link to the code in the compiler: https://github.com/clojure/clojure/blob/clojure-1.11.0/src/j...
Not sure why they have that if statement that always evaluates to true:
if(true)//!LOADER.isBound())
I usually prefer using the GitHub permalink [1] as it is easy for the line number to go out of sync.[1] https://github.com/clojure/clojure/blob/f376cf62bb0c30f72b0d...
I was confused too (and I may still be) but that's now how I understood their solution.
Their solution, IIUC from reading TFA, is that they simply didn't use eval at all anymore. So the whole "eval loads a new classloader" thinggy (so that it can be GC'ed later on) is totally moot.