Cache Eviction: When Are Randomized Algorithms Better Than LRU? (2014)

Cache Eviction: When Are Randomized Algorithms Better Than LRU? (2014)(danluu.com)

235 points by zwliew 7 years ago | 42 comments

kazinator 7 years ago |

> When Are Randomized Algorithms Better Than LRU?

When the access pattern is random: recently used items are not more likely to be accessed than other items.

Or worse: the access pattern is "anti-recent": items recently accessed are less likely to be accessed again than other items.

This is undergraduate stuff I once had to know for exams in machine architectures and operating systems.

The choice of LRU replacement is favorable when there exists "locality of reference"; that's what it's predicated upon.

All of the replacement policy choices are only substitutes for a "magic oracle": a replacement strategy which can peer into the future and know which items will be accessed soonest. With a magic oracle, we can not only retain those items in the cache in preference to other items, to great advantage, but we can automatically prefetch the correct items that are not yet in the cache but are about to be accessed.

In general, randomized algorithms provide a defense against situations in which things don't work out according to the best case assumptions. For instance, they help defend against tickling worst cases in algorithms (e.g. sorting) that have good average-case behavior.

chacham15 7 years ago | |

This comment is dangerously trivializing the problem. Even in cases where the access pattern is contiguous, not "anti-recent", etc. LRU can still fail to perform. For example, if you simply iterate over a contiguous list of 5 items with a cache size of 4, you will have a 0% hit rate with LRU. With 2 random, it would be > 0% because there is a chance you wouldnt evict the item in exactly the wrong time. The main takeaway from the article is that it is this mismatch between cache sizes which can cause problems and is evident by the data. In L1+L2 caches, LRU outperforms, but in L3 2-random outperforms because at that point, you are more likely to hit this exact case.

kazinator 7 years ago | | |

Repeatedly iterating over a sequence items is in fact "anti-recent". Round-robin cycling is probably the simplest possible algorithm for accessing a set of items such that, at any time, the item that is accessed next is the one that was least recently visited. Proof: since that item was last visited, every other item in the set was visited.

stephen422 7 years ago | | |

Your loop example is more akin to the "anti-recent" pattern: the most recently accessed item will not be accessed again sooner than any of the four other items. Contiguous patterns may well be anti-recent, whether an access is "recent" or not should be considered in relation to the size of the cache.

carlmr 7 years ago | |

>>When Are Randomized Algorithms Better Than LRU?

>When the access pattern is random: recently used items are not more likely to be accessed than other items.

Wouldn't they perform equally well in this case? There's no advantage in LRU, but also no disadvantage if they're really unpredictably random.

adrianN 7 years ago | | |

A random strategy has less overhead than LRU because there is no bookkeeping.

taeric 7 years ago | |

Even when the access pattern is not random, random can provide some benefits.

It can prevent any otherwise harmless sweep/scan of the system from completely busting your entire cache for all of the other uses.

okl 7 years ago |

> If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it's going to cause a miss every time if the loop doesn't fit. A random eviction policy degrades gracefully as the loop gets too big.

That's the key take-away from my experience with optimizing control loops for microcontrollers with small caches. As soon as your loop is too bit, your code suddenly runs at a hundredth of the speed.

eternalban 7 years ago | |

I am not sure I understand what you mean by the size of the "loop".

Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches, you will see the performance hit you mention (given that accessing the associated LL node and reordering the list will most likely result in two cache misses).

Try libclc [1] for a compact segmented cache of n 64B containers that exactly fit a cache line, with each container having its own eviction policy. The overhead of LRU per container (of 7 data lines) is 9.14 bits/entry and the same cache line access will give you both the data lines and the LRU order.

[1]: https://github.com/alphazero/libclc

blattimwind 7 years ago | | |

> Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches

He refers to the LRU-like policy of the I/D cache itself.

porpoisely 7 years ago | |

It goes beyond microcontrollers to every step up the chain. You want to optimize you system to hit the cache at every level as much as possible. The processor and its cache. The database and its cache. Web servers and their cache. Because going outside of the cache at any level usually incurs a performance hit orders of magnitude worse.

vesinisa 7 years ago |

Interesting read, but I wish he'd explain how 2-random and LRU differ better. In a nutshell, LRU tracks the time when a cache item was last accessed and always discards the oldest item. 2-random also tracks the age of all cache items, but when discarding picks two random items and discards the oldest of those two.

wizeman 7 years ago | |

It doesn't discard the oldest item, it discards the item that was least recently used/accessed (LRU, like the name says). To see the difference more clearly, note that the oldest item can in fact be the most recently used one.

boyter 7 years ago |

Love reading this. It has always been one of those interesting things I kept in the back of my mind in my day to day.

I was very excited when I actually got to implement it on a real world project.

I was writing a scale out job which used ffmpeg to make clips of video files. To speed it up I kept the downloaded files (which could be 150 GB in size) as a local cache. Quite often a clip is made of the same file. When the disk was full (there was a separate disk for download and clip output) selected two of the downloaded files randomly and deleted the older one. Loop till there was enough disk space, or no files.

It's something I thought I would never actually get to implement in the real word, and thus far is working very well, the caching speeds things up and the eviction seems to avoid too many cache misses.

zodvik 7 years ago | |

2-random was also used in this HDFS change - https://issues.apache.org/jira/browse/HDFS-8131.

Default block placement policy chose datanodes randomly for new blocks. This change selects 2 datanodes randomly and picks the one with lower used disk space.

pvorb 7 years ago |

Caffeine, a caching library for Java, has a nice write-up on more advanced cache eviction strategies. https://github.com/ben-manes/caffeine/wiki/Efficiency

harryf 7 years ago |

Worth reading http://open-zfs.org/wiki/Performance_tuning about the Adaptive Replacement Cache algorithm ( https://en.m.wikipedia.org/wiki/Adaptive_replacement_cache ) which is less vulnerable to cache flushes than LRU

shereadsthenews 7 years ago | |

The only big gain of ARC is the adaptive sizing. If you can live with static sizing you will do just as well with 2Q, which is not patented.

I haven't used a true LRU in quite a while because the coordinated bookkeeping required is as disaster on multithreaded programs. ARC and 2Q can be largely lock-free.

NovaX 7 years ago | | |

Can you explain how ARC can be more easily lock-free than LRU? ARC has to coordinate 4 doubly-linked lists and a pivot on every access.

npstr 7 years ago |

An algorithm that's more optimal than LRU on most workloads is Window TinyLFU. Some nice simulations and graphs can be found here: https://github.com/ben-manes/caffeine/wiki/Efficiency

Scaevolus 7 years ago | |

LHD is more optimal than LFU on most workloads. It's harder to implement, but more accurately computes the actual expected value of each cache item. https://www.cs.cmu.edu/~beckmann/publications/papers/2018.ns...

6keZbCECT2uB 7 years ago | | |

I glanced through the paper and didn't see a comparison to TinyLfu. Unfortunately, it seems to be a misleading name which would make one think it was simply an LFU algorithm. The description in caffeine seems to indicate that it's more of an LRU algorithm with an LFU admission heuristic.

If the overhead of the cache itself is significant, as in kernel virtual memory management. A clock based algorithm is usually preferable which uses an array to approximate node based structures. ARC and LIRS have CAR and CLOCK-Pro respectively to simulate them. CAR is vulnerable to re-use distances near the boundary of the cache size. [1] CLOCK-Pro is widely used, notably in Linux VM. [1] https://www.slideshare.net/huliang64/clockpro

NovaX 7 years ago | | |

The LHD paper lacks a robust analysis against different workloads. The few traces they use are either private or not considered representative workloads. For MSR, the provider (SNIA) specifically states: "WARNING: These traces are over 10 years old! They should not be used for modern research!".

It would be much more interesting if they used the traces from ARC and LIRS, perhaps also UMass and Wikipedia. These traces are publically available and used by multiple papers from a variety of groups.

I have not seen a sampling-based policy that outperforms a policy that has tracks history outside of its working set. Since the examples specifically avoided that comparison, I suspect that it under performs in comparison. That isn't to say it isn't a useful advancement, but that a more robust analysis is needed to validate their claims.

kccqzy 7 years ago |

Randomization can really do wonders at times. There are some algorithms for which the most efficient algorithm is essentially using randomness in a clever way. The Miller–Rabin primality test comes to mind. Or the randomized minimum cut problem. And even when deterministic algorithms have the same big-O asymptotic complexity as randomized algorithms, the randomized ones often perform better due to simpler code. Treap vs red-black tree is a classic example.

iampims 7 years ago |

It’s also working very well for load-balancing.

https://brooker.co.za/blog/2012/01/17/two-random.html

antirez 7 years ago |

Note that LFU does not have the same limits and will adapt to access patterns that are pathological for LRU like circular accesses. LRU is an approximation of LFU based on the idea that last access time is related to the future access frequency of an object. The pathological cases where LRU fails and random eviction is better is related to the fact that such access patterns violate such premise. LFU does not have such limits because it attempts to measure the actual access frequency (smoothed by time) of an object, and uses that as prediction of the future frequency.

hinkley 7 years ago | |

I have just started fiddling with treaps recently. Not applicable I suspect for a hardware cache but this thread has morphed a bit into conversations about software caches as well.

A treap is an unbalanced binary tree where the tree depth is proportional to the frequency of use to get fewer indirections per access (assuming an uneven distribution of lookups). Essentially it has a MFU or LFU queue built into it and now I'm curious about how it would behave as a cache data structure.

p1necone 7 years ago |

The article never explains what the difference between 2-random and pseudo 2-random is, does anyone know?

j16sdiz 7 years ago | |

Pseudo 2-Random use Pseudo LRU. Check the Wikipedia for Pseudo LRU.

It is a tree algorithm for tracking the approximate last use time.

drallison 7 years ago |

It would be interesting to compare the behavior of real workloads with different cache protocols. In a "real workload" the sequence of reads and writes to memory (and the cache) come from a multiplicity of independent processes functioning co-sequentially with some distribution of lifetimes of cache ownership. In many cases, the caches will be in a non-equilibrium state and behaving badly.

phyzome 7 years ago |

Note that Redis's LRU algorithm is randomized: https://redis.io/topics/lru-cache

> The reason why Redis does not use a true LRU implementation is because it costs more memory.

...but it sounds like it wouldn't necessarily be totally desirable to use true LRU anyhow. :-)

Labo333 7 years ago |

The experiments are interesting but they never explained what the strategies are. It's OK for LRU and FIFO but certainly not for 2-random!

known 7 years ago |

Random boosting could have saved Mars pathfinder https://en.wikipedia.org/wiki/Priority_inversion