# Detecting Talos regressions

This post is about modelling Talos data with a probabilistic model which can be applied to different use-cases, like detecting regressions and/or improvements over time.

Talos is Mozilla’s multiplatform performance testing framework written in python that we use to run and collect statistics of different performance tests after a push.

As a concrete example, this is how the performance data of a test might look like over time:

Even though there is some noise, which is exacerbated in this graph as the vertical axis doesn’t start from 0, we clearly see a shift of the distribution over time. We would like to detect such shifts as soon as possible after they happened.

Talos data has been known for a while to generate in some cases bi-modal data points that can break our current alerting engine.

Possible reasons for bi-modality are documented in Bug 908888. As past efforts to remove the bi-modal behavior at the source have failed we have to deal with it in our model.

The following are some notes originated from conversations with Joel, Kyle, Mauro and Saptarshi.

Mixture of Gaussians

The data can be modelled as a mixture of $K$ Gaussians, where the parameter $K$ could be determined by fitting $K$ models and selecting the best one according to some criteria.

The first obstacle is to estimate the parameters of the mixture from a set of data points. Let’s state this problem formally; if you are not interested in the mathematical derivation it suffices to know that scikit-learn has an efficient implementation of it.

EM Algorithm

We want to find the probability density $f(x)$, where $f$ is a mixture of $K$ Gaussians, that is most likely to have generated a given data point $x$:

$f(x; \theta) = \sum_{k=1}^{K} p_k g(x; \mu_k, \sigma_k)$

where $p_k$ is the mixing coefficient of cluster $K$, i.e. the probability that a generic point belongs to cluster $K$ so that $\sum_{k=1}^{K}p_k = 1$, and $g$ is the probability density function of the normal distribution:

$g(x; \mu_k, \theta_k) = \frac{1}{(\sqrt{2\pi\sigma_k})}e^{-\frac{(x - \mu_k)^2}{2\sigma_k^2}}$

Now, given a set of $N$ data points that are independent and identically distributed, we would like to determine the values of $p_k$, $\mu_k$ and $\sigma_k$ that maximize the log-likelihood function. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function.

$\log\mathcal{L}(\theta | x_1, ..., n_n) = \log\prod_{i=1}^{N} f(x_i; \theta)$

To find a maximum of  $\mathcal{L}$, let’s compute the partial derivative of it wrt $\mu_k$, $\sigma_k$ and $p_k$. Since

$\frac{\partial{g(x_i; \mu_k, \theta_k)}}{\partial{\mu_k}} = g(x_i; \mu_k, \theta_k) \frac{\partial}{\partial{\mu_k}} [-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}] = \frac{g(x_i; \mu_k, \theta_k) (x_i - \mu_k)}{\sigma_k^2}$

then

$\frac{\partial\log\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{1}{\sigma_k^2}\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}(x_i - \mu_k)$

But, by Bayes’ Theorem, $\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}$ is the conditional probability of selecting cluster $k$ given that the data point $x_i$ was observed, i.e. $p(k|x_i)$, so that:

$\frac{\log\partial\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{p(k|x_i)}{\sigma_k^2}(x_i - \mu_k)$

By applying a similar procedure to compute the partial derivative with respect to $\sigma_k$ and $p_k$ and finally setting the derivatives we just found to zero, we obtain:

$\mu_k = \frac{\sum_{i=1}^{N}p(k|x_i)x_i}{\sum_{i=1}^{N}p(k|x_i)}$

$\sigma_k = \sqrt{\frac{\sum_{i=1}^{N}p(k|x_i)(x_i - \mu_k)^2}{\sum_{n=1}^{N}p(k|x_i)}}$

$p_k = \frac{1}{N}\sum_{i=1}^{N}p(k|x_i)$

The first two equations turn out to be simply the sample mean and standard deviation of the data weighted by the conditional probability that component $k$ generated the data point $x_i$.

Since the terms $p(k|x_i)$ depend on all the terms on the left-hand side of the expressions above, the equations are hard to solve directly and this is where the EM algorithm comes to rescue. It can be proven that the EM algorithm convergences to a local maximum of the likelihood function when the following computations are iterated:

###### E Step

$p^{(n)}(k|x_i) = \frac{p_k^{(n)}g(x_i; \mu_k^{(n)}, \sigma_k^{(n)})}{\sum_{j=1}^{N}p_jg(x_i; \mu_j^{(n)}, \sigma_j^{(n)})}$

###### M Step

$\mu_k^{(n+1)} = \frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)x_i}{\sum_{i=1}^{N}p^{(n)}(k|x_i)}$

$\sigma_k^{(n+1)} = \sqrt{\frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)(x_i - \mu^{(n+1)})^2}{\sum_{n=1}^{N}p^{(n)}(k|x_i)}}$

$p_k^{(n+1)} = \frac{1}{N}\sum_{i=1}^{N}p^{(n)}(k|x_i)$

Intuitively, in the E-step the parameters of the components are assumed to be given and the data points are soft-assigned to the clusters. In the M-step we compute the updated parameters for our clusters given the new assignment.

Determine K

Now that we have a way to fit a mixture of $K$ gaussians to our data, how do we determine $K$? One way to deal with it is to generate $K$ models and select the best one according to their BIC score. Adding more components to a model will fit the data better but doing so may result in overfitting. BIC prevents this problem by introducing a penalty term for the number of parameters in the model.

Regression Detection

A simple approach to detect changes in the series is to use a rolling window and compare the distribution of the first half of the window to the distribution of the second half. Since we are dealing with Gaussians, we can use the z-statistic to compare the mean of each component in the left window to mean of its corresponding component in the right window:

$z = \frac{\mu_l - \mu_r}{\sqrt{\frac{\sigma_l^2}{n_l} + \frac{\sigma_r^2}{n_r}}}$

In the following plots the red dots are points at which the regression detection would have fired. Ideally the system would generate a single alert per cluster for the first point after the distribution shift.

Talos generates hundreds of different time series, some with dominating and peculiar noise patterns. As such it’s hard to come up with a generic model that solves the problem for good and represents the data perfectly.

Since the API to access this data is public, it provides an exciting opportunity for a contributor to come up with better ways of representing it. Feel free to join us on #perf if you are interested. Oh and, did I mentions we are hiring a Senior Data Engineer?