# On Monoids For Analytics

This is a short post on the elegance of using abstract algebra for analytics in Scala.

A monoid is a set $T$ that is closed under an associative binary operation $append$ with an identity element $zero$ such that $append(a, zero) = a$. In other words, the following 3 properties apply:

• Closure – the result of combining two elements of the set is also an elment of the set: $\forall a, b \in T: append(a, b) \in T$

• Associativity – when combining more than two elements of a set the order of the pairwise combinations doesn’t matter, which makes a monoid well suited for parallel computations: $\forall a, b, c \in T: append(append(a, b), c) = append(a, append(b, c))$

• Identity – there is a special element of the set that when combined with any other element of the set yields the same element: $\exists zero \in T: \forall a \in T: append(zero, a) = append(a, zero) = a$

One way to define a trait for a monoid in Scala is the following:

trait Monoid[T] {
def zero: T
def append(a: T, b: T): T
}


Monoids are everywhere; think of the set of natural numbers and addition or the set of strings and concatenation. Also note that the same set can have multiple “monoidal forms”; for example the set of natural numbers can have both an additive and a multiplicative monoid.

Monoids compose well; for example a tuple of monoids is itself a monoid, as such it’s simple to define a monoid for a complex type once monoids for its constituents types exists. Scalaz and Algebird are two Scala libraries that provide monoids for data types such as List, Set, Option, Map and others. Algebird in particular, which is targeted at building aggregation systems, comes with a set of monoids useful for counting such as DecayedValue for exponential decay, AveragedValue for averaging and HyperLogLog for approximate cardinality counting.

As a concrete example on how monoids provided by Scalaz are used in practice, suppose we have a server that receives sparse histograms from its clients:

import scalaz._
import Scalaz._

case class Histogram(var values: Map[Long, Long], sum: Int, count: Int)


We would like to aggregate histograms over a time window. It’s simple enough to write some code to do that, yet monoids offer an elegant solution. Since Scalaz provides by default an additive monoid for Map and Long, we can easily define one for Histogram as well:

implicit def histogramMonoid: Monoid[Histogram] = new Monoid[Histogram] {
def zero = Histogram(Map(), 0, 0)
def append(a: Histogram, b: => Histogram) = Histogram(a.values |+| b.values,
a.sum    |+| b.sum,
a.count  |+| b.count)
}


Where |+| invokes the binary operator of the monoid defined over the type of its operators.  We can aggregate histograms now that we have a monoid, e.g.:

val h1 = Histogram(Map(10 -> 12, 100 -> 41), 53, 3)
val h2 = Histogram(Map(10 -> 21, 80 -> 14), 35, 17)
h1 |+| h2


yields:

Histogram(Map(10 -> 33, 80 -> 14, 100 -> 41), 88, 20)


Let’s say that we have different histograms for different measurements which are stored in a mapping from strings to histograms. Since there is a monoid defined over Histogram then we can easily aggregate multiple Map[String, Histogram] as well:

val m1 = Map("metric1" -> h1, "metric2" -> h2)
val m2 = Map("metric1" -> h2, "metric3" -> h1)
m1 |+| m2


yields:

Map(metric1 -> Histogram(Map(10 -> 33, 80 -> 14, 100 -> 41), 88, 20),
metric2 -> Histogram(Map(10 -> 21, 80 -> 14), 35, 17),
metric3 -> Histogram(Map(10 -> 12, 100 -> 41), 53, 3))


This is just the tip of the iceberg of the elegance provided by abstract algebra. If this short article caught your interest grab a copy of Functional Programming in Scala, which is without any doubt the best functional programming book I have ever read. The book introduces a variety of abstract algebra concepts with their relative implementations.

# Detecting Talos regressions

This post is about modelling Talos data with a probabilistic model which can be applied to different use-cases, like detecting regressions and/or improvements over time.

Talos is Mozilla’s multiplatform performance testing framework written in python that we use to run and collect statistics of different performance tests after a push.

As a concrete example, this is how the performance data of a test might look like over time: Even though there is some noise, which is exacerbated in this graph as the vertical axis doesn’t start from 0, we clearly see a shift of the distribution over time. We would like to detect such shifts as soon as possible after they happened.

Talos data has been known for a while to generate in some cases bi-modal data points that can break our current alerting engine. Possible reasons for bi-modality are documented in Bug 908888. As past efforts to remove the bi-modal behavior at the source have failed we have to deal with it in our model.

The following are some notes originated from conversations with Joel, Kyle, Mauro and Saptarshi.

Mixture of Gaussians

The data can be modelled as a mixture of $K$ Gaussians, where the parameter $K$ could be determined by fitting $K$ models and selecting the best one according to some criteria.

The first obstacle is to estimate the parameters of the mixture from a set of data points. Let’s state this problem formally; if you are not interested in the mathematical derivation it suffices to know that scikit-learn has an efficient implementation of it.

EM Algorithm

We want to find the probability density $f(x)$, where $f$ is a mixture of $K$ Gaussians, that is most likely to have generated a given data point $x$: $f(x; \theta) = \sum_{k=1}^{K} p_k g(x; \mu_k, \sigma_k)$

where $p_k$ is the mixing coefficient of cluster $K$, i.e. the probability that a generic point belongs to cluster $K$ so that $\sum_{k=1}^{K}p_k = 1$, and $g$ is the probability density function of the normal distribution: $g(x; \mu_k, \theta_k) = \frac{1}{(\sqrt{2\pi\sigma_k})}e^{-\frac{(x - \mu_k)^2}{2\sigma_k^2}}$

Now, given a set of $N$ data points that are independent and identically distributed, we would like to determine the values of $p_k$, $\mu_k$ and $\sigma_k$ that maximize the log-likelihood function. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function. $\log\mathcal{L}(\theta | x_1, ..., n_n) = \log\prod_{i=1}^{N} f(x_i; \theta)$

To find a maximum of $\mathcal{L}$, let’s compute the partial derivative of it wrt $\mu_k$, $\sigma_k$ and $p_k$. Since $\frac{\partial{g(x_i; \mu_k, \theta_k)}}{\partial{\mu_k}} = g(x_i; \mu_k, \theta_k) \frac{\partial}{\partial{\mu_k}} [-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}] = \frac{g(x_i; \mu_k, \theta_k) (x_i - \mu_k)}{\sigma_k^2}$

then $\frac{\partial\log\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{1}{\sigma_k^2}\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}(x_i - \mu_k)$

But, by Bayes’ Theorem, $\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}$ is the conditional probability of selecting cluster $k$ given that the data point $x_i$ was observed, i.e. $p(k|x_i)$, so that: $\frac{\log\partial\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{p(k|x_i)}{\sigma_k^2}(x_i - \mu_k)$

By applying a similar procedure to compute the partial derivative with respect to $\sigma_k$ and $p_k$ and finally setting the derivatives we just found to zero, we obtain: $\mu_k = \frac{\sum_{i=1}^{N}p(k|x_i)x_i}{\sum_{i=1}^{N}p(k|x_i)}$ $\sigma_k = \sqrt{\frac{\sum_{i=1}^{N}p(k|x_i)(x_i - \mu_k)^2}{\sum_{n=1}^{N}p(k|x_i)}}$ $p_k = \frac{1}{N}\sum_{i=1}^{N}p(k|x_i)$

The first two equations turn out to be simply the sample mean and standard deviation of the data weighted by the conditional probability that component $k$ generated the data point $x_i$.

Since the terms $p(k|x_i)$ depend on all the terms on the left-hand side of the expressions above, the equations are hard to solve directly and this is where the EM algorithm comes to rescue. It can be proven that the EM algorithm convergences to a local maximum of the likelihood function when the following computations are iterated:

###### E Step $p^{(n)}(k|x_i) = \frac{p_k^{(n)}g(x_i; \mu_k^{(n)}, \sigma_k^{(n)})}{\sum_{j=1}^{N}p_jg(x_i; \mu_j^{(n)}, \sigma_j^{(n)})}$

###### M Step $\mu_k^{(n+1)} = \frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)x_i}{\sum_{i=1}^{N}p^{(n)}(k|x_i)}$ $\sigma_k^{(n+1)} = \sqrt{\frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)(x_i - \mu^{(n+1)})^2}{\sum_{n=1}^{N}p^{(n)}(k|x_i)}}$ $p_k^{(n+1)} = \frac{1}{N}\sum_{i=1}^{N}p^{(n)}(k|x_i)$

Intuitively, in the E-step the parameters of the components are assumed to be given and the data points are soft-assigned to the clusters. In the M-step we compute the updated parameters for our clusters given the new assignment.

Determine K

Now that we have a way to fit a mixture of $K$ gaussians to our data, how do we determine $K$? One way to deal with it is to generate $K$ models and select the best one according to their BIC score. Adding more components to a model will fit the data better but doing so may result in overfitting. BIC prevents this problem by introducing a penalty term for the number of parameters in the model.

Regression Detection

A simple approach to detect changes in the series is to use a rolling window and compare the distribution of the first half of the window to the distribution of the second half. Since we are dealing with Gaussians, we can use the z-statistic to compare the mean of each component in the left window to mean of its corresponding component in the right window: $z = \frac{\mu_l - \mu_r}{\sqrt{\frac{\sigma_l^2}{n_l} + \frac{\sigma_r^2}{n_r}}}$

In the following plots the red dots are points at which the regression detection would have fired. Ideally the system would generate a single alert per cluster for the first point after the distribution shift.   Talos generates hundreds of different time series, some with dominating and peculiar noise patterns. As such it’s hard to come up with a generic model that solves the problem for good and represents the data perfectly.

Since the API to access this data is public, it provides an exciting opportunity for a contributor to come up with better ways of representing it. Feel free to join us on #perf if you are interested. Oh and, did I mentions we are hiring a Senior Data Engineer?

# Telemetry meets Parquet

In Mozilla’s Telemetry land, raw JSON pings are stored on S3 within files containing framed Heka records, which form our immutable, append-only master dataset. Reading the raw data with Spark can be slow for analyses that read only a handful of fields per record from the thousands available; not to mention the cost of parsing the JSON blobs in the first place.

We are slowly moving away from JSON to a more OLAP-friendly serialization format for the master dataset, which requires defining a proper schema. Given that we have been collecting data in big JSON payloads since the beginning of time, different subsystems have been using various, in some cases schema un-friendly, data layouts. That and the fact that we have thousands of fields embedded in a complex nested structure makes it hard to retrospectively enforce a schema that matches our current data, so a change is likely not going to happen overnight.

Until recently the only way to process Telemetry data with Spark was to read the raw data directly, which isn’t efficient but gets the job done.

Batch views

Some analyses require filtering out a considerable amount of data before the actual workload can start. To improve the efficiency of such jobs, we started defining derived streams, or so called pre-computed batch views in lambda’s architecture lingo.

A batch view is regenerated or updated daily and, in a mathematical sense, it’s simply a function over the entire master dataset. The view usually contains a subset of the master dataset, possibly transformed, with the objective to make analyses that depend on it more efficient. As a concrete example, we have run an A/B test on Aurora 43 in which we disabled or enabled Electrolysis (E10s) based on a coin flip (note that E10s is enabled by default on Aurora). In this particular experiment we were aiming to study the performance of E10s. The experiment had a lifespan of one week per user. As some of our users access Firefox sporadically, we couldn’t just sample a few days worth of data and be done with it as ignoring the long tail of submissions would have biased our results.

We clearly needed a batch view of our master dataset that contained only submissions for users that were currently enrolled in the experiment, in order to avoid an expensive filtering step.

Parquet

We decided to serialize the data for our batch views back to S3 as Parquet files. Parquet is a columnar file storage that is slowly becoming the lingua franca of Hadoop’s ecosystem as it can be read and written from e.g. Hive, Pig & Spark.

Conceptually it’s important to clarify that Parquet is just a storage format, i.e. a binary representation of the data, and it relies on object models, like the one provided by Avro or Thrift, to represent the data in memory. A set of object model converters are provided to map between the in-memory representation and the storage format.

Parquet supports efficient compression and encoding schemes, which are applied on a per-column level where the data tends to be homogeneous. Furthermore, as Spark can load parquet files in a Dataframe, a Python analysis can potentially experience the same performance as a Scala one thanks to a unified physical execution layer.

A Parquet file is composed of:

• row groups, which contain a subset of rows – a row group is composed of a set of column chunks;
• column chunks, which contain values for a specific column – a column chunk is partitioned in a set of pages;
• pages, which are the smallest level of granularity for reads – compression and encoding are applied at this level of abstraction
• a footer, which contains the schema and some other metadata. As batch views typically use only a subset of the fields provided in our Telemetry payloads, the problem of defining a schema for such subset becomes a non-issue.

Benefits

Unsurpsingly, we have seen speed-ups of up to a couple of orders of magnitude in some analyses and a reduction of file size by up to 8x compared to files having the same compression scheme and content (no JSON, just framed Heka records).

Each view is generated with a Spark job. In the future Heka is likely going to produce the views directly, once it supports Parquet natively.