Debugging a memory leak in a Clojure service

Debugging a memory leak in a Clojure service(charanvasu.com)

73 points by whiteros_e 1 year ago | 21 comments

If your clojure pods are getting OOMKilled, you have a misconfigured JVM. The code (e.g. eval or not) mostly doesn't matter.

If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.

An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.

So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.

positr0n 1 year ago | |

> It never gets a chance to say "man I should clear up some space cause I'm running out".

To add to everything you said, depending on the type of framework you are using sometimes you don't even want it to do that. The JVM will try increasingly desperate measures, looped GC scans, ref processing, and sleeps with backoffs. With a huge heap, that can easily take hundreds to thousands of ms.

At scale, it's often better to just kill the JVM right away if the heap fills up. That way your open connections don't have all that extra latency added before the clients figure out something went wrong. Even if the JVM could recover this time, usually it will keep limping along and repeating this cycle. Obviously monitor, collect data, and determine the root cause immediately when that happens.

NightMKoder 1 year ago | | |

Of course you’re right and you really want to avoid getting to GC thrashing. IMO people still miss the old +UseGCOverheadLimit on the new GCs.

That said trying to enforce overhead limits with RSS limits also won’t end well. Java doesn’t make guarantees around max allocated but unused heap space. You need something like this: https://github.com/bazelbuild/bazel/blob/10060cd638027975480... - but I have rarely seen something like that in production.

pwagland 1 year ago | |

This is one of the areas where OpenJ9 does things a lot better than HotSpot. OpenJ9 uses one memory pool for _everything_, HotSpot has a dozen different memory pools for different purposes. This makes it much harder to tune HotSpot in containers.

pjmlp 1 year ago | |

Depends on what JVM version is being used as well, as key guideline use the latest version, or at least the latest LTS.

Folks insisting in using Java 11 or worse, Java 8, for containers are in for a surprise.

This on OpenJDK, as sibling comment points out, there are other JVMs as well.

roenxi 1 year ago |

This article showcases 2 harder-to-articulate features of Clojure:

1) Digging in to Clojure library source code is unsettlingly easy. Clojure's core implementation has 2 layers - a pure Clojure layer (which is remarkably terse, readable and interesting) and a Java layer (which is more verbose). RT (Runtime) happens to be one of the main parts of the Java layer. The experience of looking into a clojure.core function and finding 2-10 line implementation is normal.

2) Code maintenance is generally pretty easy. In this case the answer was "don't use eval" and I've had a lot of good experiences where the answer to a performance problem is similarly basic. The language tends to be responsible about using resources.

wokwokwok 1 year ago | |

While both of those things are true, I’d be hesitant to call out this:

> Once we moved to other tasks, we started seeing the pods go OOMKilled. We took turns looking into the issue, but we couldn’t determine the exact cause.

As a particular “yay clojure” kind of moment.

This was an obscure bug/“feature” in the clojure standard library. That’s not normal, and having to dig into the clojure standard library, even if it only a line or two, is certainly not something I’d be particularly calling out as standard practice or “easy” maintenance.

The standard library is for the most part enormously reliable.

You should almost never have to do this.

pron 1 year ago |

> -XX:+TraceClassLoading -XX:+TraceClassUnloading

The best insight into the operation of the JVM is now obtained via a single mechanism, JFR (https://dev.java/learn/jvm/jfr/), the JDK's observability and monitoring engine. It records a whole lot of event types: https://sap.github.io/SapMachine/jfrevents/

See here for examples related to tracking memory: https://www.morling.dev/blog/tracking-java-native-memory-wit...

ayewo 1 year ago |

Interesting article.

1. I’m having a bit of trouble parsing this paragraph:

> The reason eval loads a new classloader every time is justified as dynamically generated classes cannot be garbage collected as long as the classloader is referencing to them. In this case, single classloader evaluating all the forms and generating new classes can lead to the generated class not being garbage collected.

To avoid this, a new classloader is being created every time, this way once the evaluation is done. The classloader will no longer be reachable and all it’s dynamically loaded class.

It sounds like the solution they adopted was to instantiate a brand new classloader each time a dynamic class is evaluated, rather than use a singleton classloader for the app’s lifetime.

Sarkie 1 year ago |

It's 9/10 always the classloader and a newInstance call on every request.

MBlume 1 year ago |

The article makes it sound like the system was using eval (probably on a per-request basis, not just on start-up), and also like ceasing to use eval was pretty trivial once they realized eval was the problem. I'd be curious why they were using eval and what they were able to do instead.

adityaathalye 1 year ago | |

My thoughts exactly... off-label Eval usage.

That said, their little eval misadventure has alerted me to the details of how Clojure's eval works. I learned something today, thanks OP.

henning 1 year ago |

If you can go from ~60ms p99 response times to ~45 from reduced garbage collection, that means GC has a major impact on user-perceptible performance on your application and proves that it is an extremely expensive operation that should be carefully managed. If you have a modern microservices Kubernetes blah blah bullshit setup, this fraud detection service is probably only one part of a chain of service calls that occurs during common user operations at this company. How much of the time users wait for a few hundred bytes of actual text to load on screen is spent waiting for multiple cloud instances to GC?

The only way to eliminate its massive cost is to code the way game programmers in managed languages do and not generate any garbage, in which case GC doesn't even help you very much.

What should be hard about app scalability and performance is scaling up the database and dealing with fundamental difficulties of distributed systems. What is actually hard in practice is dealing with the infinite tower of janky bullshit the Clean Code Uncle Bob people have forced us to build which creates things like massive GC overhead that is impossible to eliminate with totally rewriting or redesigning the app.

whiteros_e 1 year ago | |

I've read somewhere that because of these getter setter patterns, JVM authors had to optimise their JIT to detect and inline those.

mikmoila 1 year ago | | |

There is nothing special in getters and setters, the runtime sees them as methods and may optimize them as it'd do for any other methods.

ysleepy 1 year ago | |

They only guess the performance difference is because of GC, generating code on the fly and compiling it to classes in your hot path is also probably not cheap.

Memory allocation/deallocation overhead is always present, just look at different allocators, fragmentation issues and so on. Using a GC is not intrinsically much different performance wise.