Alice is impatient(brooker.co.za) |
Alice is impatient(brooker.co.za) |
By focusing on the tail and optimizing worst cases you help users more than by improving your median latency.
I'm pretty sure what the author is saying is:
E(X) =:= \sum_t(t * P(X = t)) is the definition
another important note is P(X^2 = t^2) = P(X = t) - because it's the same distribution.
E_a(X) is a bit sloppy, but consider X_a aka Alice's latency "experience" distribution. The argument is:
P(X_a = t) = t * P(X = t) / \sum_u(u * P(X = u)) - i.e. scale the probability up by t but make it sum to 1.
Then
E(X_a) = \sum_t(t * P(X_a = t)) = \sum_t(t * t * P(X = t) / \sum_u(u * P(X = u))
aka
E(X^2) / E(X)
Then (from wikipedia)
Var(X) = E(X^2) - (E(X))^2
And we get
E(X_a) = (Var(X) + (E(X))^2) / E(X) = E(X) + Var(X) / E(X)
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
Is the difference more about measuring a request "across services"? That is, the total cumulative p99 across services must be small i.e. linking all requests to a user and then measuring that? Or is the difference elsewhere?
If the former: are you taking traces and graphing that? What's your methodology?
I visit HN, that's one request. But I visit HN multiple times a day. So for the operation that serves the homepage, if you took e.g. a past 24hr latency p99 chart, the number of requests analyzed would not be the same as the number of unique users involved in making those requests, potentially drastically so.
So you might see a p99 you're comfortable with, and conclude that since only 1% of requests were worse than that, it's fine. In practice though, depending on how "well-trodden" that operation is, you might very well be in a situation where all users experienced at least one such beyond-SLO event that day. It's a nonlinear relationship.
The cross operation version of this is important as well. You can have users experience snags across common flows too for example, same idea.
Regarding methodology, it's nothing special, I just rely on user IDs and correlation IDs. It really is just a perspective shift, the underlying data is the same. You can calculate back the number of nines you'd need to get an acceptable UX using this, as long the general usage habits are stable. It's just gonna be a lot more nines than two in my experience.
Say that there are to different waiting times 1s and 3s, and they happen with probability 50% each. The average waiting time (1/2 1+1/2 3) is 2s. However, 75% of the time we are waiting on a 3s event and only 25% on a 1s event. The weighted average is 2.5s. E[X^2]=1/2 1+1/2 9=5(s^2) is not the right answer, it still has to be divided by E[X]=2(s) to get the correct answer.