Correlating Firefox add-ons to slow shutdown times

This is a follow-up to my earlier findings about add-ons and startup times. In this post I am going to dig deeper between the relations of add-ons and shutdown times. A slow shutdown doesn’t seem to be a big deal. If one considers though that a new instance of Firefox can’t be launched if the old one is still shutting down, the issue becomes more serious.

It turns out that for shutdown times a simple linear model is not good enough while a log-linear instead has a reasonably good performance. Log transforming the shutdown times slighly complicates the meaning of the coefficients as they have to be interpreted as the percentage change of the average shutdown time. E.g. if an add-on has a coefficient of 100%, it means that it might (correlation is not causation!) slow down shutdown by 2 times. The idea of using a log-linear model comes from our contributors Jeremy Atia and Martin Gubri [1], which discovered the relationship during a preliminary analysis.

Rplot15Unlike in the startup case, there are fewer stronger relationships here. Some patterns start to emerge though, the Yandex add-on for instance seems to be associated with both slower startup and shutdown timings.

We started to keep track of those results on a weekly basis through a couple iacomus dashboard: one for startup and the other for shutdown times correlations. The dashboards are surely not going to win any design award but they get the job done and didn’t require any effort to setup. I am confident that by spotting consistently ill-behaved add-ons through the time-series we should be able to spot real tangible offenders.

[1] If you love probability, statistics and machine learning and are looking for an open-source project to contribute to, Firefox is a cool place to start! Get in touch with me if that sounds interesting.

Correlating Firefox add-ons to performance bottlenecks

Update: I run the analysis on more data as some add-ons had very few entries with extreme outliers that were skewing the results; I also considered more add-ons.

I started looking into exploiting our Telemetry data to determine which add-ons are causing performance issues with Firefox. So far there are three metrics that I plan to correlate with add-ons:

  • startup time,
  • shutdown time,
  • background hangs.

In this post I am going over my findings for the first scenario, i.e. the relation between startup time and installed add-ons.

In an ideal world, all add-ons would have an uniform way to initialize themselves which could be instrumented. Unfortunately that’s not possible, many add-ons use asynchronous facilities and or rely on observer notifications for initialization. In other words, there is no good way to easily measure the initialization time for all add-ons without possibly touching their codebases individually.

This is the sort of problem that screams for a multi-way ANOVA but, after some thought and data exploration, it turns out that the interaction terms can be dropped between add-ons, i.e. the relation between add-ons and the startup time can be modeled as a pure additive one. Since a multi-way ANOVA is equivalent to a linear regression between a set of predictors and their interactions, the problem can be modeled with a generalized linear model where for each Telemetry submission the add-on map is represented as a boolean vector of dummy variables that can assume a value of 0 or 1 corresponding to “add-on installed” and “add-on not installed”, respectively.

Startup time depends on many other factors that are not taken into account in the model, like current system load and hard drive parameters. This means that it would be very surprising, to say the least, if one could predict the startup time without those variables. That doesn’t mean that we can’t explain part of the variance! In fact, after training the model on the data collected during the past month, it yielded a R^2 score of about 0.15, which in other words means that we can explain about 15% of the variance. Again, as we are not trying to predict the startup time accurately this is not necessarily a bad result. The F ratio, which relates the variance between add-ons to the variance within add-ons, is significant which remarks that having or not certain add-ons installed does influence the startup time.

Many of the p-values of the predictor’s coefficients are highly significant (<< 0.001); it’s just a matter of sorting the significant results by their effect size to determine the add-ons that cause a notable slowdown of Firefox during startup:


The horizontal axis measures the startup time overhead with respect to the average startup time of Firefox. For instance, Yandex Elements seems to be slowing down startup by about 8 seconds on average. The error-bars represent the standard errors of the sampling distributions of the coefficients.

Note that the model is based on a very small fraction of our user-base, i.e. the subset that has Telemetry enabled, so there clearly is some implicit bias. The picture might be different for a truly random sample of our users, nevertheless it is an indication of where to start digging deeper.

The next step is to “dashboardify” the whole thing and contact the developers of the various add-ons. We are also considering notifying users, in a yet to be determined way, when the browser detects add-ons that are known to cause performance issues.

References: map-reduce job

Telemetry meets Clojure.

tldr: Data related telemetry alerts (e.g. histograms or main-thread IO) are now aggregated by medusa, which allows devs to post, view and filter alerts. The dashboard allows to subscribe to search criterias or individual metrics.

As mentioned in my previous post, we recently switched to a dashboard generator, “iacomus“, to visualize the data produced by some of our periodic map-reduce jobs. Given that the dashboards gained some metadata that describes their datasets, writing a regression detection algorithm based on the iacomus data-format followed naturally.

The algorithm generates a time-series for each possible combination of the filtering and sorting criterias of a dashboard, compares the latest data-point to the distribution of the previous N, and generates an alert if it detects an outlier. Stats 101.

Alerts are aggregated by medusa, which provides a RESTful API to submit alerts and exposes a dashboard that allows users to view and filter alerts using regular expressions and subscribe to alerts.

Writing the aggregator and regression detector in Clojure[script] has been a lot of fun. I found particularly attracting the fact that Clojure doesn’t have any big web framework a la Ruby or Python that forces you in one specific mindset. Instead you can roll your own using a wide set of libraries, like:

  • HTTP-Kit, an event-driven HTTP client/server
  • Compojure, a routing library
  • Korma, a SQL DSL
  • Liberator, RESTful resource handlers
  • om, React.js interface for Clojurescript
  • secretary, a client-side routing library

The ability to easily compose functionality from different libraries is exceptionally well explained by a quote from Alan Perlis: “It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures”. And so as it happens instead of each library having its own set of independent abstractions and data-structures, Clojure libraries tend to use mostly just lists, vectors, sets and maps which greatly simplify interoperability.

Lisp gets criticized for its syntax, or lack thereof, but I don’t feel that’s fair. Using any editor that inserts and balances parentheses for you does the trick. I also feel like I didn’t have to run a background thread in my mind to think if what I was writing would please the compiler or not, unlike in Scala for instance. Not to speak of the ability to use macros which allows one to easily extend the compiler with user-defined code. The expressiveness of Clojure means also that more thought is required per LOC but that might be just a side-effect of not being a full-time functional programmer.

What I do miss in the clojure ecosystem is a strong set of tools for statistics and machine learning. Incanter is a wonderful library but, coming from a R and python/scipy background, I had the impression that there is still a lot of catching up to do.

Dasbhoard generator for custom Telemetry jobs

tldr: Next time you are in need of a dashboard similar to the one used to monitor main-thread IO, please consider using my dashboard generator which takes care of displaying periodically generated data.

So you wrote your custom analysis for Telemetry, your map-reduce job is finally giving you the desired data and you want to set it up so that it runs periodically. You will need some sort of dashboard to monitor the weekly runs but since you don’t really care how it’s done what do you do? You copy paste the code of one of our current dashboards, a little tweak here and there and off you go.

That basically describes all of the recent dashboards, like the one for main-thread IO (mea culpa). Writing dashboards is painful when the only thing you care about is data. Once you finally have what you were looking for, the way you present is often considered an afterthought at best. But maintaining N dashboards becomes quickly unpleasant.

But what makes writing and maintaining dashboards so painful exactly? It’s simply that the more controls you have, the more different kind events you have to handle and the easier things get out of hand quickly. You start with something small and beautiful that just displays some csv and presto you end up with what should have been properly described as a state machine but instead is a mess of intertwined event handlers.

What I was looking for was something on the line of Shiny for R, but in javascript and with the option to have a client-only based interface. It turns out that React does more or less what I want. It’s not necessary meant for data analysis so there aren’t any plotting facilities but everything is there to roll your own. What makes exactly Shiny and React so useful is that they embrace reactive programming. Once you define a state and a set of dependencies, i.e. a data flow graph in practical terms, changes that affect the state end up being automatically propagated to the right components. Even though this can be seen as overkill for small dashboards, it makes it extremely easy to extend them when the set of possible states expands, which is almost always what happens.

To make things easier for developers I wrote a dashboard generator, iacumus, for use-cases similar to the ones we currently have. It can be used in simple scenarios when:

  • the data is collected in csv files on a weekly basis, usually using build-ids;
  • the dashboard should compare the current week against the previous one and mark differences in rankings;
  • it should be possible to go back back and forward in time;
  • the dashboard should provide some filtering and sorting criterias.

Iacumus is customizable through a configuration file that is specified through a GET parameter. Since it’s hosted on github, it means you just have to provide the data and don’t even have to spend time deploying the dashboard somewhere, assuming the machine serving the configuration file supports CORS. Here is how the end result looks like using the data for the add-on start-up scan time dashboard. Note that currently Chrome doesn’t handle properly our gzipped datasets and is unable to display anything, in case you wonder…

My next immediate goal is to simplify writing map-reduce jobs for the above mentioned use cases or to the very least write down some guidelines. For instance, some of our dashboards are based on Firefox’s version numbers and not on build-ids, which is really what you want when you desire to make comparisons of Nightly on a weekly basis.

Another interesting thought would be to automatically detect differences in the dashboards and send alerts. That might be not as easy with the current data, since a quick look at the dashboards makes it clear that the rankings fluctuate quite a bit. We would have to collect daily reports and account for the variance of the ranking in those as just using a few weekly datapoints is not reliable enough to account for the deviation.

Regression detection for Telemetry histograms.

tldr: An automatic regression detector system for Telemetry data has been deployed; the detected regressions can be seen in the dashboard.

Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.

Average frame interval during any tab open/close animation (excluding tabstrip scroll).
Average frame interval during any tab open/close animation (excluding tabstrip scroll).

Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.

Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.

I have looked at various methods to detect changes in histogram, like

  • Correlation Coefficient
  • Chi-Square statistic
  • U statistic (Mann-Whitney)
  • Kolmogorov-Smirnov statistic of the estimated densities
  • One Class Support Vector Machine
  • Bhattacharyya Distance

Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.

For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square statistic instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.

Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a good algorithm: if

  • the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and
  • the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K yielding a significant difference and
  • a significant difference is also present in the next K build-ids, then a distribution change is reported.

Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values and parameters are determined empirically from past known regressions.

I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:

Average frame interval during tab open animation of about:newtab.
Average frame interval during tab open animation of about:newtab.

We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.

Most popular GPUs on Nightly and Release

We have a long-standing bug about Firefox not scrolling smoothly with Intel GPUs. It turns out that recently, thanks to a driver update, the performance has drastically improved for some of our users. While trying to confirm this hypothesis I found some interesting data about Firefox users on Nightly 33 and Release 30 that might come in handy in deciding on what hardware to run our benchmarks in the future.

First things first, let’s have a look at the popularity of the various GPU vendors:

That doesn’t really come as a surprise considering how ubiquitous Intel chips are nowadays. This also means we should concentrate our optimization efforts towards Intel GPUs since it’s there where we can have the biggest impact.

But how well does Firefox perform with GPUs from the above mentioned vendors? It turns out that one of our telemetry metrics, that measures the average time to render a frame during a tab animation, comes in pretty handy to answer this question:

Ideally we would like to reach 60 frames per second for any vendor. Considering that practically every second user has an Intel GPU, the fact that our performance on those chips is not splendid weights even more on our shoulders. Though, as one might suspect, Intel chips are usually not as performant as Nvidia’s or AMD’s ones. There is also a suspicious difference in performance between Nightly and Release; could this be related to Bug 1013262?

The next question is what specific models do our users possess. In order to answer it, let’s have a look at the most popular GPUs that account for 25% of our user base on Nightly:


and on Release:

As you can notice there are only Intel GPUs in the top 25% of the population. The HD 4000 and 3000 are dominating on Nightly while the distribution is much more spread out on Release with older models, like the GMA 4500, being quite popular. Unsurprisingly though the most popular chips are mobile ones.

Let’s take now as reference the most popular chips on Release and see how their performance compares to Nightly:

We can immediately spot that newer chips perform generally better than older ones, unsurprisingly. But more importantly, there is a marked performance difference between the same models on Nightly and Release which will require further investigation. Also noteworthy is that older desktop models outperform newer mobile ones.

And finally we can try to answer the original question by correlating the tab animation performance with the various driver versions. In order to do that, let’s take the Intel HD 4000 and plot the performance for its most popular drivers on the release channel and compare it to the nightly one:

We can notice that there is a clear difference in performance between the older 9.X drivers and the newer 10.X which answers our original question. Unfortunately though only about 25% of our users on the release channel have updated their driver to a recent 10.X version. Also, the difference between the older and newer drivers is more marked on the nightly channel than on the release one.

We are currently working on an alerting system for Telemetry that will notify us when a relevant change appears in our metrics. This should allow us to catch pre-emptively regressions, like Bug 1032185, or improvements, like the one mentioned in this blog post, without accidentally stumbling on it.

Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

P(H|D) = \frac{P(D|H)*P(H)}{P(D)}

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.


According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

P(A=F|B=F) = 1 - P(A=T|B=F)

= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1


P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons A_{1}, ..., A_{k} , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have A_{1}, ..., A_{k} . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}   eq. 1

which corresponds to the ratio between Telemetry submissions that contain A_{1}, ..., A_{k}, B and the submissions that contain all those add-ons but B. Note that if

B \in \{A_{1}, ..., A_{k}\}

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

B \not\in \{A_{1}, ..., A_{k}\}

Since the denominator is constant for any B, we don’t really care about it.

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)   eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)

we can rewrite eq. 2 like so

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)

Awesome, this means that once we calculate the N probabilities


where N is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just k multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

y(t) = \alpha sin(2\pi \omega t + \varphi)

where \alpha is the amplitude, \omega is the frequency and \varphi is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise \Sigma_i :

Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i

where Y_i , A , \Omega , \Phi and W_i are random variables and

W_i = \mathcal{N}(0, \sigma^2)

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac  {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)}  {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}

where the functions f are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that A , \Omega , \Phi and \Sigma are independent then

f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)

where f_{A} , f_{\Omega} , f_{\Phi} and f_{\Sigma} are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements y is

f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.