Counting at scale

12 Apr, 2016

How engaged are users for a certain segment of the population? How many users are actively using a new feature? One way to answer that question is to compute the engagement ratio (ER) for that segment, which is defined as daily active users (DAU) over monthly active users (MAU), i.e.

$$ ER_{segment} = \frac{DAU_{segment}}{MAU_{segment}} $$

Intuitively the closer the ratio is to 1, the higher the number of returning users. A segment can be an arbitrary combination of values across a set of dimensions, like operating system and activity date.

Ideally, given a set of $ D $ dimensions, pre-computing the ER for all possible combinations of values of said dimensions would ensure that user queries run as fast as possible. Clearly that’s not very efficient though; if we assume for simplicity that every dimension has $ k $ possible values, then there are $ \sum_{d=1}^{D}\binom{D}{d} k^d $ ratios.

Is there a way around computing all of those ratios while still having an acceptable query latency? One could build a set of users for each value of each dimension and then at query time simply use set union and set intersection to compute any desired segment. Unfortunately that doesn’t work when you have millions of users as the storage complexity is proportional to the number of items stored in the sets. This is where probabilistic counting and the HyperLogLog sketch comes into play.

HyperLogLog

The HyperLogLog sketch can estimate cardinalities well beyond $ 10^9 $ with a standard error of 2% while only using a couple of KB of memory.

The intuition behind is simple. Imagine a hash function that maps user identifiers to a string of N bits. A good hash function ensures that each bit has the same probability of being flipped. If that’s the case then the following is true:

50 % of hashes have the prefix 1
25 % of hashes have the prefix 01
12.5% of hashes the prefix 001
…

Intuitively, if 4 distinct values are hashed we would expect to see on average one single hash with a prefix of 01 while for 8 distinct values we would expect to see one hash with a prefix of 001 and so on. In other words, the cardinality of the original set can be estimated from the longest prefix of zeros of the hashed values. To reduce the variability of this single estimator, the average of K estimators can be used as the approximated cardinality and it can be shown that the standard error of a HLL sketch is $ \frac{1.04}{sqrt(K)} $.

The detailed algorithm is documented in the original paper, and its practical implementation and variants are covered in depth by a 2013 paper from Google.

Set operations

One of the nice properties of HLL is that the union of two HLL sketches is very simple to compute as the union of a single estimator with another estimator is just the maximum of the two estimators, i.e. the longest prefix of zeros. This property makes the algorithm trivially parallelizable which makes it well suited for map-reduce style computations.

What about set intersection? There are two ways to compute that for e.g. two sets:

Using the inclusion-exclusion principle: $ |A \cap B| = |A| + |B| - |A \cup B| $.
Using the MinHash (MH) sketch, which estimates the Jaccard index that measures how similar two sets are: $ \frac{|A \cap B|}{|A \cup B|} $. Given the MH sketch one could estimate the intersection with $ |A \cap B| \approx MH(A, B) \times |A \cup B| $.

It turns out that both approaches yield bad approximations when the overlap is small, which makes set intersection not particularly attractive for many use-cases.

Implementation

At Mozilla we use both Spark and Presto for analytics and even though both support HLL their implementation is not compatible, i.e. Presto can’t read HLL sketches serialized from Spark. To that end we created a Spark package, spark-hyperloglog, and a Presto plugin, presto-hyperloglog, to extend both systems with the same HLL implementation.

#Observability #Statistics