Perks – Effectively compute quantiles for unbounded data streams(groups.google.com) |
Perks – Effectively compute quantiles for unbounded data streams(groups.google.com) |
Is there any way to get a similar thing for a sliding window of a stream? For example, to be able to report (estimated) 90th percentile latencies for server requests in the last 5 minutes, hour, and day.
(Intuitively, a sliding window seems very hard/impossible -- how do you discard old events without keeping a complete record?)
If anyone is interested this was my write up for algorithm selection: http://blog.tempo-db.com/post/42318820124/estimating-percent...
Distributing the computation was quite easy. I emailed the authors of the paper and they gave me a quick answer, which is what I implemented and tested.
I'm interested in adding more implementations of the problem to perks, like Q-Digest, along with other streaming data problems.
I use this library for Q-Digest, it's worth taking a look at for an implementation reference. https://github.com/clearspring/stream-lib
Basically, when you have more data than memory and time to sort them in order to find the percentile you're looking for, you need to employ an algorithm that trades rank selection accuracy for lower memory and CPU costs. This package does that.
But for all but the most extreme cases, it's sufficient to just keep all the values in memory until they fall out of your window. Even if you're getting 1000 requests/second, that's still only 300,000 values that you have to store.
Edit:
I do agree that if you're working with datasets that fit in memory, you're probably better off keeping all the samples to find your percentile and not using this package. In fact, perks will not compress for datasets under 500 values.