This post is about modelling Talos data with a probabilistic model which can be applied to different use-cases, like detecting regressions and/or improvements over time.

Talos is Mozilla’s multiplatform performance testing framework written in python that we use to run and collect statistics of different performance tests after a push.

As a concrete example, this is how the performance data of a test might look like over time:

Even though there is some noise, which is exacerbated in this graph as the vertical axis doesn’t start from 0, we clearly see a shift of the distribution over time. We would like to detect such shifts as soon as possible after they happened.

Talos data has been known for a while to generate in some cases bi-modal data points that can break our current alerting engine.

Possible reasons for bi-modality are documented in Bug 908888. As past efforts to remove the bi-modal behavior at the source have failed we have to deal with it in our model.

The following are some notes originated from conversations with Joel, Kyle, Mauro and Saptarshi.

**Mixture of Gaussians**

The data can be modelled as a mixture of Gaussians, where the parameter could be determined by fitting models and selecting the best one according to some criteria.

The first obstacle is to estimate the parameters of the mixture from a set of data points. Let’s state this problem formally; if you are not interested in the mathematical derivation it suffices to know that scikit-learn has an efficient implementation of it.

**EM Algorithm**

We want to find the probability density , where is a mixture of Gaussians, that is most likely to have generated a given data point :

where is the mixing coefficient of cluster , i.e. the probability that a generic point belongs to cluster so that , and is the probability density function of the normal distribution:

Now, given a set of data points that are independent and identically distributed, we would like to determine the values of , and that maximize the log-likelihood function. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function.

To find a maximum of , let’s compute the partial derivative of it wrt , and . Since

then

But, by Bayes’ Theorem, is the conditional probability of selecting cluster given that the data point was observed, i.e. , so that:

By applying a similar procedure to compute the partial derivative with respect to and and finally setting the derivatives we just found to zero, we obtain:

The first two equations turn out to be simply the sample mean and standard deviation of the data weighted by the conditional probability that component generated the data point .

Since the terms depend on all the terms on the left-hand side of the expressions above, the equations are hard to solve directly and this is where the EM algorithm comes to rescue. It can be proven that the EM algorithm convergences to a local maximum of the likelihood function when the following computations are iterated:

**E Step**

**M Step**

Intuitively, in the E-step the parameters of the components are assumed to be given and the data points are soft-assigned to the clusters. In the M-step we compute the updated parameters for our clusters given the new assignment.

**Determine K**

Now that we have a way to fit a mixture of gaussians to our data, how do we determine ? One way to deal with it is to generate models and select the best one according to their BIC score. Adding more components to a model will fit the data better but doing so may result in overfitting. BIC prevents this problem by introducing a penalty term for the number of parameters in the model.

**Regression Detection**

A simple approach to detect changes in the series is to use a rolling window and compare the distribution of the first half of the window to the distribution of the second half. Since we are dealing with Gaussians, we can use the z-statistic to compare the mean of each component in the left window to mean of its corresponding component in the right window:

In the following plots the red dots are points at which the regression detection would have fired. Ideally the system would generate a single alert per cluster for the first point after the distribution shift.

Talos generates hundreds of different time series, some with dominating and peculiar noise patterns. As such it’s hard to come up with a generic model that solves the problem for good and represents the data perfectly.

Since the API to access this data is public, it provides an exciting opportunity for a contributor to come up with better ways of representing it. Feel free to join us on #perf if you are interested. Oh and, did I mentions we are hiring a Senior Data Engineer?

While better change detection would be useful, I don’t actually think it’s our biggest problem. The t-test method that we currently use works 95% or so of the time, for the few cases that it doesn’t it’s easy enough to mark a performance alert as “invalid”. Still, since it’s just a matter of plugging in some new code, it should be easy to test whether this technique (or some other) is better. Do you have the source code for this somewhere?

I think the most pressing need is some way of measuring the overall noisiness/bimodality of a series for the perfherder compare view, which is unfortunately rather useless due to the number of false positive “regressions” and “improvements” it shows. See this for example:

https://treeherder.allizom.org/perf.html#/compare?originalProject=fx-team&originalRevision=eedcfc7cefd3&newProject=fx-team&newRevision=fbb5323919c3&showOnlyConfident=1

Many of the changes are actually invalid due to bimodal data (e.g. all the tscrollx regressions/improvements)

Once there is a model that reflects the data it can applied to different use-cases. Compare view uses only few data-points which makes it extremely hard to be able to detect regression, even more so in the presence of bimodality.

The approach outlined here could be applied by using test results from past and future revisions to increase the sample sizes. Alternatively one could use the GMM to model data from multiple historical revisions and use that as the “base” and finally compute the joint probability that the new data-points have been generated by the base model.

You can find the code here:

http://nbviewer.ipython.org/gist/vitillo/81b152144fe72c2d7161 .