View Counting at Reddit

489 points by strzalek 9 years ago | 112 comments

haburka 9 years ago |

I love the article on hyperloglog! It is really quite good to read even if you're not interested in algorithms. I always liked number theory and I think that it's very interesting that you can guess how many uniques there are by counting how long your longest run of zeroes in a hash is.

I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.

lucasschm 9 years ago | |

You are correct, but HyperLogLog has many buckets counting the longest run of zeros in order to avoid the problem of outliers. I recently studied these probabilistic algorithms and did a notebook with code and plots to show their performance: https://github.com/lucasschmidtc/Probabilistic-Algorithms/bl...

snowcrshd 9 years ago | | |

Thanks for sharing that!

Just skimmed through it and seems pretty interesting. I'll read it more in depth later.

anhldbk 9 years ago | | |

Thanks for sharing guy! Interesting repo.

twotwotwo 9 years ago | |

For outliers by random chance, lucasschm's reply explains.

The usual trick for preventing folks _maliciously_ sending in outliers is to use a hash with a secret key such as SipHash so that folks on the outside can't trivially figure out what inputs will lead to hashes with a lot of leading zeroes.

Lukassus 9 years ago | |

The HyperLogLog was a very nice article, but I wanted to ask, is this related to the estimation of Naci tanks during WW2 by Allies? https://www.wired.com/2010/10/how-the-allies-used-math-again...

andreareina 9 years ago | | |

The German Tank Problem guesses the size of a set, given a limited sample and successive serial numbers. If they had randomized the serial numbers it wouldn't have worked.

HyperLogLog is different because you have the entire population (not just a sample), and it's a multiset (the same element can appear more than once). Getting the size of a (non-multi) set is easy, you just keep a counter and increment it for each element; it only takes enough memory to maintain the counter. Counting the distinct members of a multiset takes a lot more memory because you have to remember whether you've already seen a particular element or not.

The tl;dr is that the German Tank Problem is about making an estimate of size when you have imperfect information, and HyperLogLog gives you an estimate when you have perfect information, but it's too expensive to make an exact calculation.

nyar 9 years ago |

"We want to better communicate the scale of Reddit to our users."

If that's true why did they hide vote numbers on comments and posts? It used to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.

jonknee 9 years ago | |

It's to deter bots. The numbers weren't previously accurate, they were fuzzed (also to deter bots).

https://www.reddit.com/wiki/faq#wiki_how_is_a_submission.27s...

ma2rten 9 years ago | | |

I don't quite see the connection. How exactly does this deter bots?

stevenh 9 years ago | |

Suppose there's a comment with 12 upvotes and 207 downvotes. Now suppose you're reddit and you want to make this comment seem more popular than it actually is.

You could slowly remove the downvotes, but attentive people refreshing the page will see the number shrinking and become suspicious, because real users never suddenly do a mass retraction of their votes.

You could slowly add a bunch of upvotes, but then people will wonder why this comment with a consistent 12/207 popularity ratio for the past hour suddenly overcame it and became the most popular comment in the thread. People will suspect a coordinated raid took place.

Both approaches raise too much suspicion. The safest approach is to turn off the ability to view the separate upvote/downvote values altogether and use a simple easing function to artificially increase the comment's total score over time. When no one can see the upvote/downvote ratio or the volume of vote activity over time, they lose the ability to judge whether manipulation is taking place.

You need an excuse for the change, so don't forget to also come up with a spurious narrative about how it was supposedly done to fight bots.

joshuamorton 9 years ago | | |

Except that vote fuzzing was always a thing, so when you'd see 1000 upvotes and 100 downvotes, that could be off by an enormous amount (I think a reddit admin said by a factor of 5x or 10x in extreme cases, ie. frontpage posts).

awalton 9 years ago | |

...because it's almost certainly a lie. It's because they want to better communicate the scale of Reddit to their customers - the companies on the other side of the links who they are driving content to.

It's way easier to say "Hey, we're giving you XYZ traffic, give us ABC dollars," when you have the figures in front of you rather than just upvote/downvote numbers.

mxmxm 9 years ago |

Counting views/impressions in combination with Apache Kafka sounds like the ideal use case for a stream processor like Apache Flink. It supports very large state which can be managed off-hand. This should enable you to count the exact number of unique views in real time with exactly once semantics. Here is a blog post on large scale counting with more details. It also includes a comparison with other streaming technologies like Sanza and Spark: https://data-artisans.com/blog/counting-in-streams-a-hierarc...

Also check out this blog post by a Twitter engineer on counting ad impressions: https://data-artisans.com/blog/extending-the-yahoo-streaming...

noamhacker 9 years ago |

How do you test a system like this for accuracy? Is this done by simulating millions of unique requests?

andreareina 9 years ago | |

The algorithm's accuracy is known. From the wiki[1]:

    The HyperLogLog algorithm is able to estimate 
    cardinalities of > 10^9 with a typical error rate of 2%

[1] https://en.wikipedia.org/wiki/HyperLogLog

federicoponzi 9 years ago | | |

But what about the implementation accuracy? :)

GhostVII 9 years ago | |

Reddit probably has enough analytics to be able to show mathematically that it will be accurate without simulating any requests.

icelancer 9 years ago | |

Can't you just use Apache Benchmark and some proxies?

alzaeem 9 years ago |

So how do they determine whether a user has viewed a post already? I would think that unique counting is accomplished using the hyperloglog counter, but the article says that this decision is made by the Nazar system, which doesn't use the hyperloglog counter in Redis.

hrshtr 9 years ago | |

Thats true, I am thinking that Nazar is more like spam filter and monitors the user behavior.

kchandra 9 years ago | | |

Pretty much, yeah.

lucasschm 9 years ago | |

Bloom Filters? It has false positives but no false negatives

jimmaswell 9 years ago | |

Why can't they just associate a list of viewed posts with each user, or list of users that viewed a post with each post, and check that? I don't get why this needs any consideration.

sethammons 9 years ago | | |

They addressed your second point in the article. On a popular post, you would be storing several megabytes of data to capture/relate each unique user that visited. That gets expensive at scale. HLL takes then down to a few kilobytes, less than 1% of the original size.

For your first suggestion, you would have to do a very expensive look up. You couldn't cache it effectively due to the requirement of near real time stats. You could improve look up time using columnar storage, but the performance and memory usage will be nowhere near as nice as with HLL.

Problems are harder at scale.

eropple 9 years ago | | |

Have you stopped to think how many users that is and how many posts?

Viewing a single thread could require five hundred associations.

stoicking 9 years ago |

Given how much simpler it is to count total views than unique user views, why is it more valuable to count unique user views?

jonathanbull 9 years ago | |

From a Reddit engineer:

"This was a product decision. Currently view counts are purely cosmetic, but we did not want to rule out the possibility of them being used in ranking in the future. As such, building in some degree of abuse protection made sense (e.g. someone can't just sit on a page refreshing to make the view number go up). I am fully expecting us to tweak this time window (and the duplication heuristics in general) in future, especially as the way that users interact with content will change as Reddit evolves."

https://www.reddit.com/r/programming/comments/6da6n9/comment...

danso 9 years ago | |

Because it's more valuable of a data point to those who care about overall audience and reach. Someone visiting repeatedly might be evidence of an engaged user, but things like ads would have diminishing returns.

Namrog84 9 years ago | |

Possibly to combat bots or artificially inflated view statistics?

Splendor 9 years ago | |

Because it's a metric advertisers care about.

tudorconstantin 9 years ago |

Wouldn't it had been easier to simply increment a counter for each visit and then set a short lived cookie in the browser for that post? And put the spam detection system before the counter increment

tsukaisute 9 years ago |

Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Again, you get 3. Seems like a replication issue?

kelnage 9 years ago | |

> Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Reload, you get 3. Seems like a replication issue?

This is done on purpose [1], to prevent bots from calculating exact post/comment scores.

1. https://www.reddit.com/wiki/faq#wiki_how_is_a_comment.27s_sc...

Xeoncross 9 years ago | | |

I still don't understand what purpose this feature has - can you explain more?

samtho 9 years ago | |

That's vote fuzzing you're seeing. It's to prevent people (read: bots) from being able to tell if they are shadowbanned.

kondor6c 9 years ago | |

I believe they are using cassandra to store the upvotes

sverhagen 9 years ago | | |

Just curious if this is a stab at Cassandra, or whether use of Cassandra would automatically imply eventual consistency or something else that would appear in this way?

ketralnis 9 years ago | | |

That one is in postgres

theomega 9 years ago |

Very interesting article, thanks for publishing.

I have two related questions: 1. I assume the process which reads from Cassandra and puts it back to Redis is parallized if not even distributed. How do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you lock in Redis? 2. What database is used to actually store the view counts? Cassandras Counters are afaik not very reliable...

kchandra 9 years ago | |

1. Redis is atomic, so we use the SETNX operation to ensure that only one write succeeds.

2. We have HLLs in Redis, so we just issue a PFCOUNT and store the result of that in Cassandra as an integer value. We don't use counters in Cassandra.

ronalbarbaren 9 years ago |

Thanks Reddit guys. I hope engineer of Youtube will post similar article. Still curious how Youtube count.

hellbanner 9 years ago |

Slightly OT; but I wish reddit would use traditional forum style replies to push threads up, instead of the positive feedback loop of votes with opinions that agree with majority getting upvotes giving views which give proportionally more upvotes

federicoponzi 9 years ago |

Probably noob question, but:

>> Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka.

Why don't they just discard it instead of reputting the event back to Kafka?

bashtoni 9 years ago | |

I suspect they archive events into S3 or similar for later analysis/training.

golergka 9 years ago |

A beautiful example of how a feature that seems so easy to an end user can be complex at scale.

fiatjaf 9 years ago |

At https://trackingco.de/ we store events on Redis and compile them daily into a reduced string format, storing these on CouchDB.

ugh123 9 years ago |

Forgive my ignorance, but isn't this what Google Analytics is for?

PetahNZ 9 years ago | |

Google Analytics is not accurate (its sampled), or realtime (48 hour turn around).

raquo 9 years ago | | |

^ For big sites like reddit, which is why you don't typically run into this when using GA on your personal blog

659087 9 years ago | |

Google Analytics is for giving Google the ability to track your users.

ckarmann 9 years ago | |

That would not help to prevent illegitimate views like those generated by spambots.

hashhar 9 years ago | |

It uses sampling to generate reports AFAIK.

qrbLPHiKpiux 9 years ago |

Not applied to /r/the_donald however.

hexane360 9 years ago | |

Are you talking about the "impressions"/subscribers incident? Because that was a mislabeled field that affected almost every other sub more than T_d.

https://www.reddit.com/r/help/comments/62naj4/can_someone_ex...

https://www.reddit.com/r/SubredditDrama/comments/62nw33/rthe...

hellbanner 9 years ago | |

I didn't see this in the article