Regression detection for Telemetry histograms.

Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.

Average frame interval during any tab open/close animation (excluding tabstrip scroll).
Average frame interval during any tab open/close animation (excluding tabstrip scroll).

Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.

Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.

I have looked at various methods to detect changes in histogram, like

  • Correlation Coefficient
  • Chi-Square statistic
  • U statistic (Mann-Whitney)
  • Kolmogorov-Smirnov statistic of the estimated densities
  • One Class Support Vector Machine
  • Bhattacharyya Distance

Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.

For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square statistic instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.

Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a good algorithm: if

  • the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and
  • the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K yielding a significant difference and
  • a significant difference is also present in the next K build-ids, then a distribution change is reported.

Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values and parameters are determined empirically from past known regressions.

I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:

Average frame interval during tab open animation of about:newtab.
Average frame interval during tab open animation of about:newtab.

We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.

Most popular GPUs on Nightly and Release

We have a long-standing bug about Firefox not scrolling smoothly with Intel GPUs. It turns out that recently, thanks to a driver update, the performance has drastically improved for some of our users. While trying to confirm this hypothesis I found some interesting data about Firefox users on Nightly 33 and Release 30 that might come in handy in deciding on what hardware to run our benchmarks in the future.

First things first, let’s have a look at the popularity of the various GPU vendors:

That doesn’t really come as a surprise considering how ubiquitous Intel chips are nowadays. This also means we should concentrate our optimization efforts towards Intel GPUs since it’s there where we can have the biggest impact.

But how well does Firefox perform with GPUs from the above mentioned vendors? It turns out that one of our telemetry metrics, that measures the average time to render a frame during a tab animation, comes in pretty handy to answer this question:

Ideally we would like to reach 60 frames per second for any vendor. Considering that practically every second user has an Intel GPU, the fact that our performance on those chips is not splendid weights even more on our shoulders. Though, as one might suspect, Intel chips are usually not as performant as Nvidia’s or AMD’s ones. There is also a suspicious difference in performance between Nightly and Release; could this be related to Bug 1013262?

The next question is what specific models do our users possess. In order to answer it, let’s have a look at the most popular GPUs that account for 25% of our user base on Nightly:

nightly_pop

and on Release:

As you can notice there are only Intel GPUs in the top 25% of the population. The HD 4000 and 3000 are dominating on Nightly while the distribution is much more spread out on Release with older models, like the GMA 4500, being quite popular. Unsurprisingly though the most popular chips are mobile ones.

Let’s take now as reference the most popular chips on Release and see how their performance compares to Nightly:

We can immediately spot that newer chips perform generally better than older ones, unsurprisingly. But more importantly, there is a marked performance difference between the same models on Nightly and Release which will require further investigation. Also noteworthy is that older desktop models outperform newer mobile ones.

And finally we can try to answer the original question by correlating the tab animation performance with the various driver versions. In order to do that, let’s take the Intel HD 4000 and plot the performance for its most popular drivers on the release channel and compare it to the nightly one:


We can notice that there is a clear difference in performance between the older 9.X drivers and the newer 10.X which answers our original question. Unfortunately though only about 25% of our users on the release channel have updated their driver to a recent 10.X version. Also, the difference between the older and newer drivers is more marked on the nightly channel than on the release one.

We are currently working on an alerting system for Telemetry that will notify us when a relevant change appears in our metrics. This should allow us to catch pre-emptively regressions, like Bug 1032185, or improvements, like the one mentioned in this blog post, without accidentally stumbling on it.