Analyzing the codebase of Caffeine, a high performance caching library

Analyzing the codebase of Caffeine, a high performance caching library(adriacabeza.github.io)

268 points by synthc 1 year ago | 47 comments

jedberg 1 year ago |

It would be interesting to see this on reddit's workload. The entire system was designed around the cache getting a 95%+ hit rate, because basically anything on front page of the top 1000 subreddits will get the overwhelming majority of traffic, so the cache is mostly filled with that.

In other words, this solves the problem of "one hit wonders" getting out of the cache quickly, but that basically already happened with the reddit workload.

The exception to that was Google, which would scrape old pages, and which is why we shunted them to their own infrastructure and didn't cache their requests. Maybe with this algo, we wouldn't have had to do that.

masklinn 1 year ago | |

Wouldn’t one hit wonders still be an issue? They might get evicted relatively fast anyway but assuming an LRU each will still take a cache entry until they go through the entire thing and finally get evicted.

Although if that’s your concern you can probably just add a smaller admission cache in front of the main cache, possibly with a promotion memory.

JanecekPetr 1 year ago | | |

That's kind of the idea of Caffeine, it has admission buffers, and it adapts automatically between LRU and LFU. The original algorithm is called Windiw TinyLFU (design https://github.com/ben-manes/caffeine/wiki/Design), see it in action e.g. here: https://github.com/ben-manes/caffeine/wiki/Efficiency

adbachman 1 year ago | |

what are/were Reddit's top two or three cached structures / things?

guessing post bodies and link previews feels too easy.

comment threads? post listings?

was there a lot of nesting?

it sounds like you're describing a whole post--use message, comments, and all--for presentation to a browser or crawler.

(sorry, saw the handle and have so many questions :D)

NovaX 1 year ago | |

Doesn’t reddit use Cassandra, Solr, and Kafka which uses Caffeine?

jbellis 1 year ago |

Caffeine is a gem. Does what it claims, no drama, no scope creep, just works. I've used it in anger multiple times, most notably in Apache Cassandra and DataStax Astra, where it handles massive workloads invisibly, just like you'd want.

Shoutout to author Ben Manes if he sees this -- thanks for the great work!

plandis 1 year ago | |

Plus Ben made it extremely easy to migrate from Google Guava’s cache. It’s mostly the same API and way more performant to switch to Caffeine.

NovaX 1 year ago | |

Thanks Jonathan!

hinkley 1 year ago |

Years ago I encountered a caching system that I misremembered as being a plugin for nginx and thus was never able to track down again.

It had a clever caching algorithm that favored latency over bandwidth. It weighted hit count versus size, so that given limited space, it would rather keep two small records that had more hits than a large record, so that it could serve more records from cache overall.

For some workloads the payload size is relatively proportional to the cost of the request - for the system of record. But latency and request setup costs do tend to shift that a bit.

But the bigger problem with LRU is that some workloads eventually resemble table scans, and the moment the data set no longer fits into cache, performance falls off a very tall cliff. And not just for that query but now for all subsequent ones as it causes cache misses for everyone else by evicting large quantities of recently used records. So you need to count frequency not just recency.

YZF 1 year ago | |

For every caching algorithm you can design an adversarial workload that will perform poorly with the cache. Your choice of caching algorithm/strategy needs to match your predicted workload. As you're alluding there's also the question of which resource are you trying to optimize for, if you're trying to minimize processing time that might be a little different than optimizing for bandwidth.

hinkley 1 year ago | | |

If you have to refetch on a cache miss you're going to be doing both. But all optimizations are always playing with the trigraph of cpu time, memory, and IO (with the hidden fourth dimension of legibility), so I don't think you're saying anything that can't be assumed as given. Even among people who tend to pick incorrectly, or just lose track of when the situation has changed.

thomastay 1 year ago |

> However, diving into a new caching approach without a deep understanding of our current system seemed premature

Love love love this - I really enjoy reading articles where people analyze existing high performance systems instead of just going for the new and shiny thing

dan-robertson 1 year ago |

Near the beginning, the author writes:

> Caching is all about maximizing the hit ratio

A thing I worry about a lot is discontinuities in cache behaviour (simple example: let’s say a client polls a list of entries, and downloads each entry from the list one at a time to see if it is different. Obviously this feels like a bit of a silly way for a client to behave. If you have a small lru cache (eg maybe it is partitioned such that partitions are small and all the requests from this client go to the same partition) then there is some threshold size where the client transitions from ~all requests hitting the cache to ~none hitting the cache.)

This is a bit different from some behaviours always being bad for cache (eg a search crawler fetches lots of entries once).

Am I wrong to worry about these kinds of ‘phase transitions’? Should the focus just be on optimising hit rate in the average case?

quotemstr 1 year ago |

Huh. Their segmented LRU setup is similar to the Linux kernel's active and inactive lists for pages. Convergent evolution in action.

NovaX 1 year ago | |

I tried to reimplement Linux’s algorithm in [1], but I cannot be sure about correctness. They adjust the fixed sizes at construction based on device’s total memory, so it varies if a phone or server. This fast trace simulation in the CI [2] may be informative (see DClock). Segmentation is very common, where algorithms differ by how they promote and how/if they adapt the sizes.

[1] https://github.com/ben-manes/caffeine/blob/master/simulator/...

[2] https://github.com/ben-manes/caffeine/actions/runs/130865965...

nighthawk454 1 year ago |

Seems to be hugged, so here's a cached view

https://archive.is/w8yFG

https://web.archive.org/web/20250202094451/https://adriacabe... (images are cached better here)

dstroot 1 year ago |

Codebase has >16k stars on GitHub and only 1 open issue, and 3 open PRs. Never seen that before on a highly used codebase. Kudos to the maintainer(s).

bean-weevil 1 year ago | |

I went through some of the issues to see how aggressively they close them and found this gem: https://github.com/ben-manes/caffeine/issues/1824#issuecomme...

ketzo 1 year ago | | |

Damn, I need that framed over my desk.

Lord_Zero 1 year ago | |

I haven't looked, but stalebot can make repos look squeaky clean when in reality issues are ignored and then closed without being addressed.

calpaterson 1 year ago | | |

Sparing everyone else a browse of the bugtracker: the maintainer does not seem to use a bot to autoclose issues. The close issues appeared to be actually closed and it seemed from a quick glance that he actually investigated each filing.

FridgeSeal 1 year ago | | |

Google-Adjacent OSS repos are the worst for this!

Passive-Aggressive Stalebot marches into the middle of an _active_ thread and announces that the issue is now stale/dead and everyone should go away.

I get that on large projects there’s a non-trivial percentage of issues which amount to “I’m holding it wrong, didn’t actually read the log messages, or the manual, fix it for me pls” which are just unhelpful noise. However more often than not they take every other thread- including important ones, with them.

homebrewer 1 year ago | |

kitty is very close, which is impressive when you remember that the vast majority of the work is done by one guy (Kovid Goyal).

https://github.com/kovidgoyal/kitty/issues — 0.239% vs 0.137%

https://github.com/kovidgoyal/kitty/issues — 0.729% vs 0.317%

https://github.com/kovidgoyal/kitty/graphs/contributors

jupiterroom 1 year ago |

really random question - but what is used to create the images in this blog post? I see this style quite often but never been able to track down what is used.

itishappy 1 year ago | |

https://excalidraw.com/

https://d2lang.com/

https://www.drawio.com/

For something a bit lower level, try:

https://roughjs.com/

It's what powers the sketch-like look from many of the sites above.

atombender 1 year ago | |

I suspect they used Excalidraw [1]. It's a nice, quick tool for this kind of sketching, and supports collaborative drawing.

[1] https://excalidraw.com/

jupiterroom 1 year ago | |

i.e these - https://adriacabeza.github.io/img/tinylfu.png

syct 1 year ago | | |

Try this https://excalidraw.com/

homarp 1 year ago | |

if you look at the metadata of each png, you will find the application/vnd/excalidraw/json field that contains the image in excalidraw format.

synthc 1 year ago |

Interesting deep dive on the internals of Caffeine, a widely used JVM caching library.

urbandw311er 1 year ago |

Caffeine is also the name of a macOS utility to stop the screen going to sleep. Be great if whichever came second could consider a name change.

unification_fan 1 year ago | |

Actually, Caffeine is the name of a hello world script I wrote back in the 80s, so...