Recommending Firefox add-ons with Spark

We are currently evaluating possible replacements for our Telemetry map-reduce infrastructure. As our current data munging machinery isn’t distributed, analyzing days worth of data can be quite a pain. Also, many algorithms can’t easily be expressed with a simple map/reduce interface.

So I decided to give Spark another try. “Another” because I have played with it in the past but I didn’t feel it was mature enough to be run in production. And I wasn’t the only one to think that apparently. I feel like things have changed though with the latest 1.1 release and I want to share my joy with you.

What is Spark?

In a nutshell, “Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.”

Spark primary abstraction is the Resilient Distributed Dataset (RDD), which one can imagine as a distributed pandas or R data frame. The RDD API comes with all kinds of distributed operations, among which also our dear map and reduce. Many RDD operations accept user-defined Scala or Python functions as input which allow average Joe to write distributed applications like a pro.

A RDD can also be converted to a local Scala/Python data structure, assuming the dataset is small enough to fit in memory. The idea is that once you chopped the data you are not interested in, what you are left with fits comfortably on a single machine. Oh and did I mention that you can issue Spark queries directly from a Scala REPL? That’s great for performing exploratory data analyses.

The greatest strength of Spark though is the ability to cache RDDs in memory. This allows you to run iterative algorithms up to 100x faster than using the typical Hadoop based map-reduce framework! It has to be remarked though that this feature is purely optional. Spark works flawlessly without caching, albeit slower. In fact in a recent benchmark Spark was able to sort 1PB of data 3X faster using 10X fewer machines than Hadoop, without using the in-memory cache.

Setup

A Spark cluster can run in standalone mode or on top of YARN or Mesos. To the very least for a cluster you will need some sort of distributed filesystem, e.g. HDFS or NFS. But the easiest way to play with it though is to run Spark locally, i.e. on OSX:


brew install spark
spark-shell --master "local[*]"

The above commands start a Scala shell with a local Spark context. If you are more inclined to run a real cluster, the easiest way to get you going is to launch an EMR cluster on AWS:


aws emr create-cluster --name SparkCluster --ami-version 3.3 --instance-type m3.xlarge \
  --instance-count 5 --ec2-attributes KeyName=vitillo --applications Name=Hive \
  --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark

Then, once connected to the master node, launch Spark on YARN:


yarn-client /home/hadoop/spark/bin/spark-shell --num-executors 4 --executor-cores 8 \
  --executor-memory 8g —driver-memory 8g

The parameters of the executors (aka worker nodes) should obviously be tailored to the kind of instances you launched. It’s imperative to spend some time understanding and tuning the configuration options as Spark doesn’t automagically do it for you.

Now what?

Time for some real code. Since Spark makes it so easy to write distributed analyses, the bar for a Hello World application should be consequently be much higher. Let’s write then a simple, albeit functional, Recommender Engine for Firefox add-ons.

In order to do that, let’s first go over quickly the math involved. It turns out that given a matrix of the rankings of each user for each add-on, the problem of finding a good recommendation can be reduced to matrix factorization problem:

factorization

The model maps both users and add-ons to a joint latent factor space of dimensionality F. Both users and add-ons are thus seen as vectors in that space. The factors express latent characteristics of add-ons, e.g. if a an add-on is related to security or to UI customization. The ratings are then modeled as inner products in that space, which is proportional to the angle of the two vectors. The closer the characteristics of an add-on align to the preferences of the user in the latent factor space, the higher the rating.

But wait, Firefox users don’t really rate add-ons. In fact the only information we have in Telemetry is binary: either a user has a certain add-on installed or he hasn’t. Let’s assume that if someone has a certain add-on installed, he probably likes that add-on. That’s not true in all cases and a more significant metric like “usage time” or similar should be used.

I am not going to delve into the details, but having binary ratings changes the underlying model slightly from the conceptual one we have just seen. The interested reader should read this paper. Mllib, a machine learning library for Spark, comes out of the box with a distributed implementation of ALS which implements the factorization.

Implementation

Now that we have an idea of the theory, let’s have a look at how the implementation looks like in practice. Let’s start by initializing Spark:


val conf = new SparkConf().setAppName("AddonRecommender")
val sc = new SparkContext(conf)

As the ALS algorithm requires tuples of (user, addon, rating), let’s munge the data into place:


val ratings = sc.textFile("s3://mreid-test-src/split/").map(raw => {
  val parsedPing = parse(raw.substring(37))
  (parsedPing \ "clientID", parsedPing \ "addonDetails" \ "XPI")
}).filter{
  // Remove sessions with missing id or add-on list
  case (JNothing, _) => false
  case (_, JNothing) => false
  case (_, JObject(List())) => false
  case _ => true
}.map{ case (id, xpi) => {
  val addonList = xpi.children.
    map(addon => addon \ "name").
    filter(addon => addon != JNothing && addon != JString("Default"))
  (id, addonList)
}}.filter{ case (id, addonList) => {
  // Remove sessions with empty add-on lists
  !addonList.isEmpty
}}.flatMap{ case (id, addonList) => {
  // Create add-on ratings for each user
  addonList.map(addon => (id.extract[String], addon.extract[String], 1.0))
}}

Here we extract the add-on related data from our json Telemetry pings and filter out missing or invalid data. The ratings variable is a RDD and as you can see we used the distributed map, filter and flatMap operations on it. In fact it’s hard to tell apart vanilla Scala code from the distributed one.

As the current ALS implementation doesn’t accept strings for the user and add-on representations, we will have to convert them to numeric ones. A quick and dirty way of doing that is to hash the strings:


// Positive hash function
def hash(x: String) = x.hashCode & 0x7FFFFF

val hashedRatings = ratings.map{ case(u, a, r) => (hash(u), hash(a), r) }.cache
val addonIDs = ratings.map(_._2).distinct.map(addon => (hash(addon), addon)).cache

We are nearly there. To avoid overfitting, ALS uses regularization, the strength of which is determined by a parameter \lambda . As we don’t know beforehand the optimal value of the parameter, we can try to find it by minimizing the mean squared error over a pre-defined grid of \lambda values using k-fold cross-validation.


// Use cross validation to find the optimal number of latent factors
val folds = MLUtils.kFold(hashedRatings, 10, 42)
val lambdas = List(0.1, 0.2, 0.3, 0.4, 0.5)
val iterations = 10
val factors = 100 // use as many factors as computationally possible

val factorErrors = lambdas.flatMap(lambda => {
  folds.map{ case(train, test) =>
    val model = ALS.trainImplicit(train.map{ case(u, a, r) => Rating(u, a, r) }, factors, iterations, lambda, 1.0)
    val usersAddons = test.map{ case (u, a, r) => (u, a) }
    val predictions = model.predict(usersAddons).map{ case Rating(u, a, r) => ((u, a), r) }
    val ratesAndPreds = test.map{ case (u, a, r) => ((u, a), r) }.join(predictions)
    val rmse = sqrt(ratesAndPreds.map { case ((u, a), (r1, r2)) =>
      val err = (r1 - r2)
      err * err
    }.mean)

    (model, lambda, rmse)
  }
}).groupBy(_._2)
  .map{ case(k, v) => (k, v.map(_._3).reduce(_ + _) / v.length) }

Finally, it’s just a matter of training ALS on the whole dataset with the optimal \lambda value and we are good to go to use the recommender:


// Train model with optimal number of factors on all available data
val model = ALS.trainImplicit(hashedRatings.map{case(u, a, r) => Rating(u, a, r)}, factors, iterations, optimalLambda._1, 1.0)

def recommend(userID: Int) = {
  val predictions = model.predict(addonIDs.map(addonID => (userID, addonID._1)))
  val top = predictions.top(10)(Ordering.by[Rating,Double](_.rating))
  top.map(r => (addonIDs.lookup(r.product)(0), r.rating))
}

recommend(hash("UUID..."))

I omitted some details but you can find the complete source on my github repository.

To submit the packaged job to YARN run:


spark-submit --class AddonRecommender --master yarn-client --num-executors 4 \
  --executor-cores 8 --executor-memory 8g addon-recommender_2.10-1.0.jar

So what?

Question is, how well does it perform? The mean squared error isn’t really telling us much so let’s take some fictional user session and see what the recommender spits out.

For user A that has only the add-on Ghostery installed, the top recommendations are, in order:

  • NoScript
  • Web of Trust
  • Symantec Vulnerability Protection
  • Better Privacy
  • LastPass
  • DuckDuckGo Plus
  • HTTPS-Everywhere
  • Lightbeam
  • Google Translator for Firefox

One could argue that 1 out of 10 recommendations isn’t appropriate for a security aficionado. Now it’s the turn of user B who has only the Firebug add-on installed:

  • Web Developer
  • FiddlerHook
  • Greasemonkey
  • ColorZilla
  • User Agent Switcher
  • McAfee
  • RealPlayer Browser Record Plugin
  • FirePHP
  • Session Manager

There are just a couple of add-ons that don’t look that great but the rest could fit the profile of a developer. Now, considering that the recommender was trained only on a couple of days of data for Nightly, I feel like the result could easily be improved with more data and tuning, like filtering out known Antivirus, malware and bloatware.

Popular hw/sw configurations of Firefox users

Knowing the distribution of our users wrt. cpu/gpu/os etc. is one of those question that comes up time and time again. After a couple of times of running a custom map-reduce job on our Telemetry data I decided to write a periodic job so that we can keep track of it and quickly get the most updated data.

Here is a distribution tree of all the data collected on the 20th of October on the release channel:

distributionThere are many ways of displaying the data but this is one I find particularly useful as I am more interested in the frequency of the combination of factors than than the distribution of the single factors. The size of the factors tries to be proportional to the frequency of the prefix.

For instance, the most common machine configuration has a spinning disk, 4 GB of memory, 4 cores, an Intel GPU and runs Firefox 32.0.3 on Windows 6.1 (aka Windows 7). Note that the GPU refers to the main one detected when Firefox launches. This means that if a machine has more than one accelerator, only the one active during startup is reported. This is clearly suboptimal and we have a bug on file to address this issue.

The online dashboard also allows to dig in the individual nodes and show the cumulative percentage of users for the hovered node.

Correlating Firefox add-ons to slow shutdown times

This is a follow-up to my earlier findings about add-ons and startup times. In this post I am going to dig deeper between the relations of add-ons and shutdown times. A slow shutdown doesn’t seem to be a big deal. If one considers though that a new instance of Firefox can’t be launched if the old one is still shutting down, the issue becomes more serious.

It turns out that for shutdown times a simple linear model is not good enough while a log-linear instead has a reasonably good performance. Log transforming the shutdown times slighly complicates the meaning of the coefficients as they have to be interpreted as the percentage change of the average shutdown time. E.g. if an add-on has a coefficient of 100%, it means that it might (correlation is not causation!) slow down shutdown by 2 times. The idea of using a log-linear model comes from our contributors Jeremy Atia and Martin Gubri [1], which discovered the relationship during a preliminary analysis.

Rplot15Unlike in the startup case, there are fewer stronger relationships here. Some patterns start to emerge though, the Yandex add-on for instance seems to be associated with both slower startup and shutdown timings.

We started to keep track of those results on a weekly basis through a couple iacomus dashboard: one for startup and the other for shutdown times correlations. The dashboards are surely not going to win any design award but they get the job done and didn’t require any effort to setup. I am confident that by spotting consistently ill-behaved add-ons through the time-series we should be able to spot real tangible offenders.

[1] If you love probability, statistics and machine learning and are looking for an open-source project to contribute to, Firefox is a cool place to start! Get in touch with me if that sounds interesting.

Correlating Firefox add-ons to performance bottlenecks

Update: I run the analysis on more data as some add-ons had very few entries with extreme outliers that were skewing the results; I also considered more add-ons.

I started looking into exploiting our Telemetry data to determine which add-ons are causing performance issues with Firefox. So far there are three metrics that I plan to correlate with add-ons:

  • startup time,
  • shutdown time,
  • background hangs.

In this post I am going over my findings for the first scenario, i.e. the relation between startup time and installed add-ons.

In an ideal world, all add-ons would have an uniform way to initialize themselves which could be instrumented. Unfortunately that’s not possible, many add-ons use asynchronous facilities and or rely on observer notifications for initialization. In other words, there is no good way to easily measure the initialization time for all add-ons without possibly touching their codebases individually.

This is the sort of problem that screams for a multi-way ANOVA but, after some thought and data exploration, it turns out that the interaction terms can be dropped between add-ons, i.e. the relation between add-ons and the startup time can be modeled as a pure additive one. Since a multi-way ANOVA is equivalent to a linear regression between a set of predictors and their interactions, the problem can be modeled with a generalized linear model where for each Telemetry submission the add-on map is represented as a boolean vector of dummy variables that can assume a value of 0 or 1 corresponding to “add-on installed” and “add-on not installed”, respectively.

Startup time depends on many other factors that are not taken into account in the model, like current system load and hard drive parameters. This means that it would be very surprising, to say the least, if one could predict the startup time without those variables. That doesn’t mean that we can’t explain part of the variance! In fact, after training the model on the data collected during the past month, it yielded a R^2 score of about 0.15, which in other words means that we can explain about 15% of the variance. Again, as we are not trying to predict the startup time accurately this is not necessarily a bad result. The F ratio, which relates the variance between add-ons to the variance within add-ons, is significant which remarks that having or not certain add-ons installed does influence the startup time.

Many of the p-values of the predictor’s coefficients are highly significant (<< 0.001); it’s just a matter of sorting the significant results by their effect size to determine the add-ons that cause a notable slowdown of Firefox during startup:

Rplot13

The horizontal axis measures the startup time overhead with respect to the average startup time of Firefox. For instance, Yandex Elements seems to be slowing down startup by about 8 seconds on average. The error-bars represent the standard errors of the sampling distributions of the coefficients.

Note that the model is based on a very small fraction of our user-base, i.e. the subset that has Telemetry enabled, so there clearly is some implicit bias. The picture might be different for a truly random sample of our users, nevertheless it is an indication of where to start digging deeper.

The next step is to “dashboardify” the whole thing and contact the developers of the various add-ons. We are also considering notifying users, in a yet to be determined way, when the browser detects add-ons that are known to cause performance issues.

References: map-reduce job

Telemetry meets Clojure.

tldr: Data related telemetry alerts (e.g. histograms or main-thread IO) are now aggregated by medusa, which allows devs to post, view and filter alerts. The dashboard allows to subscribe to search criterias or individual metrics.

As mentioned in my previous post, we recently switched to a dashboard generator, “iacomus“, to visualize the data produced by some of our periodic map-reduce jobs. Given that the dashboards gained some metadata that describes their datasets, writing a regression detection algorithm based on the iacomus data-format followed naturally.

The algorithm generates a time-series for each possible combination of the filtering and sorting criterias of a dashboard, compares the latest data-point to the distribution of the previous N, and generates an alert if it detects an outlier. Stats 101.

Alerts are aggregated by medusa, which provides a RESTful API to submit alerts and exposes a dashboard that allows users to view and filter alerts using regular expressions and subscribe to alerts.

Writing the aggregator and regression detector in Clojure[script] has been a lot of fun. I found particularly attracting the fact that Clojure doesn’t have any big web framework a la Ruby or Python that forces you in one specific mindset. Instead you can roll your own using a wide set of libraries, like:

  • HTTP-Kit, an event-driven HTTP client/server
  • Compojure, a routing library
  • Korma, a SQL DSL
  • Liberator, RESTful resource handlers
  • om, React.js interface for Clojurescript
  • secretary, a client-side routing library

The ability to easily compose functionality from different libraries is exceptionally well explained by a quote from Alan Perlis: “It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures”. And so as it happens instead of each library having its own set of independent abstractions and data-structures, Clojure libraries tend to use mostly just lists, vectors, sets and maps which greatly simplify interoperability.

Lisp gets criticized for its syntax, or lack thereof, but I don’t feel that’s fair. Using any editor that inserts and balances parentheses for you does the trick. I also feel like I didn’t have to run a background thread in my mind to think if what I was writing would please the compiler or not, unlike in Scala for instance. Not to speak of the ability to use macros which allows one to easily extend the compiler with user-defined code. The expressiveness of Clojure means also that more thought is required per LOC but that might be just a side-effect of not being a full-time functional programmer.

What I do miss in the clojure ecosystem is a strong set of tools for statistics and machine learning. Incanter is a wonderful library but, coming from a R and python/scipy background, I had the impression that there is still a lot of catching up to do.

Dasbhoard generator for custom Telemetry jobs

tldr: Next time you are in need of a dashboard similar to the one used to monitor main-thread IO, please consider using my dashboard generator which takes care of displaying periodically generated data.

So you wrote your custom analysis for Telemetry, your map-reduce job is finally giving you the desired data and you want to set it up so that it runs periodically. You will need some sort of dashboard to monitor the weekly runs but since you don’t really care how it’s done what do you do? You copy paste the code of one of our current dashboards, a little tweak here and there and off you go.

That basically describes all of the recent dashboards, like the one for main-thread IO (mea culpa). Writing dashboards is painful when the only thing you care about is data. Once you finally have what you were looking for, the way you present is often considered an afterthought at best. But maintaining N dashboards becomes quickly unpleasant.

But what makes writing and maintaining dashboards so painful exactly? It’s simply that the more controls you have, the more different kind events you have to handle and the easier things get out of hand quickly. You start with something small and beautiful that just displays some csv and presto you end up with what should have been properly described as a state machine but instead is a mess of intertwined event handlers.

What I was looking for was something on the line of Shiny for R, but in javascript and with the option to have a client-only based interface. It turns out that React does more or less what I want. It’s not necessary meant for data analysis so there aren’t any plotting facilities but everything is there to roll your own. What makes exactly Shiny and React so useful is that they embrace reactive programming. Once you define a state and a set of dependencies, i.e. a data flow graph in practical terms, changes that affect the state end up being automatically propagated to the right components. Even though this can be seen as overkill for small dashboards, it makes it extremely easy to extend them when the set of possible states expands, which is almost always what happens.

To make things easier for developers I wrote a dashboard generator, iacumus, for use-cases similar to the ones we currently have. It can be used in simple scenarios when:

  • the data is collected in csv files on a weekly basis, usually using build-ids;
  • the dashboard should compare the current week against the previous one and mark differences in rankings;
  • it should be possible to go back back and forward in time;
  • the dashboard should provide some filtering and sorting criterias.

Iacumus is customizable through a configuration file that is specified through a GET parameter. Since it’s hosted on github, it means you just have to provide the data and don’t even have to spend time deploying the dashboard somewhere, assuming the machine serving the configuration file supports CORS. Here is how the end result looks like using the data for the add-on start-up correlations dashboard. Note that currently Chrome doesn’t handle properly our gzipped datasets and is unable to display anything, in case you wonder…

My next immediate goal is to simplify writing map-reduce jobs for the above mentioned use cases or to the very least write down some guidelines. For instance, some of our dashboards are based on Firefox’s version numbers and not on build-ids, which is really what you want when you desire to make comparisons of Nightly on a weekly basis.

Another interesting thought would be to automatically detect differences in the dashboards and send alerts. That might be not as easy with the current data, since a quick look at the dashboards makes it clear that the rankings fluctuate quite a bit. We would have to collect daily reports and account for the variance of the ranking in those as just using a few weekly datapoints is not reliable enough to account for the deviation.

Regression detection for Telemetry histograms.

tldr: An automatic regression detector system for Telemetry data has been deployed; the detected regressions can be seen in the dashboard.

Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.

Average frame interval during any tab open/close animation (excluding tabstrip scroll).
Average frame interval during any tab open/close animation (excluding tabstrip scroll).

Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.

Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.

I have looked at various methods to detect changes in histogram, like

  • Correlation Coefficient
  • Chi-Square statistic
  • U statistic (Mann-Whitney)
  • Kolmogorov-Smirnov statistic of the estimated densities
  • One Class Support Vector Machine
  • Bhattacharyya Distance

Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.

For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square statistic instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.

Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a good algorithm: if

  • the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and
  • the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K yielding a significant difference and
  • a significant difference is also present in the next K build-ids, then a distribution change is reported.

Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values and parameters are determined empirically from past known regressions.

I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:

Average frame interval during tab open animation of about:newtab.
Average frame interval during tab open animation of about:newtab.

We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.