How engaged are our users for a certain segment of the population? How many users are actively using a new feature? One way to answer that question is to compute the engagement ratio (ER) for that segment, which is defined as daily active users (DAU) over monthly active users (MAU), i.e.
Intuitively the closer the ratio is to 1, the higher the number of returning users. A segment can be an arbitrary combination of values across a set of dimensions, like operating system and activity date.
Ideally, given a set of dimensions, pre-computing the ER for all possible combinations of values of said dimensions would ensure that user queries run as fast as possible. Clearly that’s not very efficient though; if we assume for simplicity that every dimension has possible values, then there are ratios.
Is there a way around computing all of those ratios while still having an acceptable query latency? One could build a set of users for each value of each dimension and then at query time simply use set union and set intersection to compute any desired segment. Unfortunately that doesn’t work when you have millions of users as the storage complexity is proportional to the number of items stored in the sets. This is where probabilistic counting and the HyperLogLog sketch comes into play.
The HyperLogLog sketch can estimate cardinalities well beyond with a standard error of 2% while only using a couple of KB of memory.
The intuition behind is simple. Imagine a hash function that maps user identifiers to a string of N bits. A good hash function ensures that each bit has the same probability of being flipped. If that’s the case then the following is true:
- 50 % of hashes have the prefix 1
- 25 % of hashes have the prefix 01
- 12.5% of hashes the prefix 001
Intuitively, if 4 distinct values are hashed we would expect to see on average one single hash with a prefix of 01 while for 8 distinct values we would expect to see one hash with a prefix of 001 and so on. In other words, the cardinality of the original set can be estimated from the longest prefix of zeros of the hashed values. To reduce the variability of this single estimator, the average of K estimators can be used as the approximated cardinality and it can be shown that the standard error of a HLL sketch is .
One of the nice properties of HLL is that the union of two HLL sketches is very simple to compute as the union of a single estimator with another estimator is just the maximum of the two estimators, i.e. the longest prefix of zeros. This property makes the algorithm trivially parallelizable which makes it well suited for map-reduce style computations.
What about set intersection? There are two ways to compute that for e.g. two sets:
- Using the inclusion-exclusion principle: .
- Using the MinHash (MH) sketch, which estimates the Jaccard index that measures how similar two sets are: . Given the MH sketch one could estimate the intersection with .
It turns out that both approaches yield bad approximations when the overlap is small, which makes set intersection not particularly attractive for many use-cases, which is why we decided to use only the union operation for HLL sketches for our datasets.
Going back to the original problem of estimating the ER for an arbitrary segment, by computing the HLL sketches for all combinations of values across all dimensions the HLL sketch for any segment can be derived using set union.
At Mozilla we use both Spark and Presto for analytics and even though both support HLL their implementation is not compatible, i.e. Presto can’t read HLL sketches serialized from Spark. To that end we created a Spark package, spark-hyperloglog, and a Presto plugin, presto-hyperloglog, to extend both systems with the same HLL implementation.
Mozilla employees can access the client_count table from re:dash and run Presto queries against it to compute DAU, MAU or the ER for a certain segment. For example, this is how DAU faceted by channel and activity date can be computed:
SELECT normalizedchannel, activitydate, cardinality(merge(cast(hll AS HLL))) AS hll FROM client_count GROUP BY activitydate, normalizedchannel ORDER BY normalizedchannel, activitydate DESC
The merge aggregate function computes the set union while the cardinality function returns the cardinality of a sketch. A complete example with DAU, MAU and ER for the electrolysis engagement ratio can be seen here.
By default the HLL sketches in the client_count table have a standard error of about 1.6%.