ClientID in Telemetry submissions

A new functionality landed recently that allows to group Telemetry sessions by profile ID. Being able to group sessions by profile turns out be extremely useful for several reasons. For instance, as some users tend to generate an enourmous amount of sessions daily, analyses tend to be skewed towards those users.

Take uptime duration; if we just consider the uptime distribution of all sessions collected in a certain timeframe on Nightly we would get a distribution with a median duration of about 15 minutes. But that number isn’t really representative of the median uptime for our users. If we group the submissions by Client ID and compute the median uptime duration for each group, we can build a new distribution that is more representative of the general population:

clientid

And we can repeat the exercise for the startup duration, which is expressed in ms:

startupOur dashboards are still based on the session distributions but it’s likely that we will provide both session and user based distributions in our next-gen telemetry dash.

edit:

Please keep in mind that:

  • Telemetry does not collect privacy-sensitive data.
  • You do not have to trust us, you can verify what data Telemetry is collecting in the about:telemetry page in your browser and in aggregate form in our Telemetry dashboards.
  • Telemetry is an opt-in feature on release channels and a feature you can easily disable on other channels.
  • The new Telemetry Client ID does not track users, it tracks Telemetry performance & feature-usage metrics across sessions.

A Telemetry API for Spark

Check out my previous post about Spark and Telemetry data if you want to find out what all the fuzz is about Spark. This post is a step by step guide on how to run a Spark job on AWS and use our simple Scala API to load Telemetry data.

The first step is to start a machine on AWS using Mark Reid’s nifty dashboard, in his own words:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona (with an @mozilla.com email address as mentioned above).
  2. Click “Launch an ad-hoc analysis worker”.
  3. Enter some details. The “Server Name” field should be a short descriptive name, something like “mreid chromehangs analysis” is good. Upload your SSH public key (this allows you to log in to the server once it’s started up).
  4. Click “Submit”.
  5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure. Once it’s ready, you can SSH in and run your analysis job. Reload the webpage after a couple of minutes to get the full SSH command you’ll use to log in.

Now connect to the machine and clone my starter project template:


git clone https://github.com/vitillo/mozilla-telemetry-spark.git
cd mozilla-telemetry-spark && source aws/setup.sh

The setup script will install Oracle’s JDK among some other bits. Now we are finally ready to give Spark a spin by launching:

sbt run

The command will run a simple job that computes the Operating System distribution for a small number of pings. It will take a while to complete as sbt, an interactive build tool for Scala, is downloading all required dependencies for Spark on the first run.


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import org.json4s._
import org.json4s.jackson.JsonMethods._

import Mozilla.Telemetry._

object Analysis{
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("mozilla-telemetry").setMaster("local[*]")
    implicit val sc = new SparkContext(conf)
    implicit lazy val formats = DefaultFormats

    val pings = Pings("Firefox", "nightly", "36.0a1", "*", ("20141109", "20141110")).RDD(0.1)

    var osdistribution = pings.map(line => {
      ((parse(line.substring(37)) \ "info" \ "OS").extract[String], 1)
    }).reduceByKey(_+_).collect

    println("OS distribution:")
    osdistribution.map(println)

    sc.stop()
  }
}

If it the job completed successfully, you should see something like this in the output:

OS distribution:
(WINNT,4421)
(Darwin,38)
(Linux,271)

To start writing your job, simply customize src/main/scala/main.scala to your needs. The Telemetry Scala API allows you to define a RDD for a Telemetry filter:

val pings = Pings("Firefox", "nightly", "36.0a1", "*", ("20141109", "20141110")).RDD(0.1)

The above statement retrieves a sample of 10% of all Telemetry submissions of Firefox received on the 9th and 10th of November for any build-id of nightly 36. The last 2 parameters of Pings, i.e. build-id and date, accept either a single value or a tuple specifying a range.

If you are interested to learn more, there is going to be a Spark & Telemetry tutorial at MozLandia next week! I will briefly go over the data layout of Telemetry and how Spark works under the hood and finally jump in a hands-on interactive analysis session with real data. No prerequisites are required in terms of Telemetry, Spark or distributed computing.

Time: Friday, December 5, 2014, 1 PM
Location: Belmont Room, Marriott Waterfront 2nd, Mozlandia

Update
I have uploaded the slides and the tutorial.

Clustering Firefox hangs

Jim Chen recently implemented a system to collect stacktraces of threads running for more than 500ms. A summary of the aggregated data is displayed in a nice dashboard in which the top N aggregated stacks are shown according to different filters.

I have looked at a different way to group the frames that would help us identify the culprits of main-thread hangs, aka jank. The problem with aggregating stackframes and looking at the top N is that there is a very long tail of stacks that are not considered. It might very well be that by ignoring the tail we are missing out some important patterns.

So I tried different clustering techniques until I settled with the very simple solution of aggregating the traces by their last frame. Why the last frame? When I used k-means to cluster the traces I noticed that, for many of the more interesting clusters the algorithm found, most stacks had the last frame in common, e.g.:

  • Startup::XRE_Main, (chrome script), Timer::Fire, nsRefreshDriver::Tick, PresShell::Flush, PresShell::DoReflow
  • Startup::XRE_Main, Timer::Fire, nsRefreshDriver::Tick, PresShell::Flush, PresShell::DoReflow
  • Startup::XRE_Main, EventDispatcher::Dispatch, (content script), PresShell::Flush, PresShell::DoReflow

Aggregating by the last frame yields clusters that are big enough to be considered interesting in terms of number of stacktraces and are likely to explain the most common issues our users experience.

Currently on Aurora, the top 10 meaningful offending main-thread frames are in order of importance:

  1. PresShell::DoReflow accounts for 5% of all stacks
  2. nsCycleCollector::collectSlice accounts for 4.5% of all stacks
  3. nsJSContext::GarbageCollectNow accounts for 3% of all stacks
  4. IPDL::PPluginInstance::SendPBrowserStreamConstructor accounts for 3% of all stacks
  5. (chrome script) accounts for 3% all stacks
  6. filterStorage.js (Adblock Plus?) accounts for 2.7% of all stacks
  7. nsStyleSet::FileRules accounts for 2.7% of all stacks
  8. IPDL::PPluginInstance::SendNPP_Destroy accounts for 2% of all stacks
  9. IPDL::PPluginScriptableObject::SendHasProperty accounts for 2% of all stacks
  10. IPDL::PPluginScriptableObject::SendInvoke accounts for 1.7% of all stacks

Even without showing sample stacks for each cluster, there is some useful information here. The elephants in the room are clearly plugins; or should I say Flash? But just how much do “plugins” hurt our responsiveness? In total, plugin related traces account for about 15% of all hangs. It also seems that the median duration of a plugin hang is not different from a non-plugin one, i.e. between 1 and 2 seconds.

But just how often does a hang occur during a session? Let’s have a look:

hangs

The median number of hangs for a session amounts to 5; the mean is not that interesting as there are big outliers that skew the data. Also noteworthy is that the median duration of a session is about 16 minutes.

As one would expect, the number of hangs tend to increase as the duration of a session does:

hangsvsuptime

Tha analysis was run on a week’s worth of data for Aurora (over 50M stackframes) and similar results where obtained on previous weeks.

There is some work in progress to improve the status quo. Aaron Klotz’s formidable async plugin initialization is going to eliminate trace 4 and he might tackle frame 8 in the future. Furthermore, a recent improvent in cycle collection is hopefully going to reduce the impact of frame 2.

Recommending Firefox add-ons with Spark

We are currently evaluating possible replacements for our Telemetry map-reduce infrastructure. As our current data munging machinery isn’t distributed, analyzing days worth of data can be quite a pain. Also, many algorithms can’t easily be expressed with a simple map/reduce interface.

So I decided to give Spark another try. “Another” because I have played with it in the past but I didn’t feel it was mature enough to be run in production. And I wasn’t the only one to think that apparently. I feel like things have changed though with the latest 1.1 release and I want to share my joy with you.

What is Spark?

In a nutshell, “Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.”

Spark primary abstraction is the Resilient Distributed Dataset (RDD), which one can imagine as a distributed pandas or R data frame. The RDD API comes with all kinds of distributed operations, among which also our dear map and reduce. Many RDD operations accept user-defined Scala or Python functions as input which allow average Joe to write distributed applications like a pro.

A RDD can also be converted to a local Scala/Python data structure, assuming the dataset is small enough to fit in memory. The idea is that once you chopped the data you are not interested in, what you are left with fits comfortably on a single machine. Oh and did I mention that you can issue Spark queries directly from a Scala REPL? That’s great for performing exploratory data analyses.

The greatest strength of Spark though is the ability to cache RDDs in memory. This allows you to run iterative algorithms up to 100x faster than using the typical Hadoop based map-reduce framework! It has to be remarked though that this feature is purely optional. Spark works flawlessly without caching, albeit slower. In fact in a recent benchmark Spark was able to sort 1PB of data 3X faster using 10X fewer machines than Hadoop, without using the in-memory cache.

Setup

A Spark cluster can run in standalone mode or on top of YARN or Mesos. To the very least for a cluster you will need some sort of distributed filesystem, e.g. HDFS or NFS. But the easiest way to play with it though is to run Spark locally, i.e. on OSX:


brew install spark
spark-shell --master "local[*]"

The above commands start a Scala shell with a local Spark context. If you are more inclined to run a real cluster, the easiest way to get you going is to launch an EMR cluster on AWS:


aws emr create-cluster --name SparkCluster --ami-version 3.3 --instance-type m3.xlarge \
  --instance-count 5 --ec2-attributes KeyName=vitillo --applications Name=Hive \
  --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark

Then, once connected to the master node, launch Spark on YARN:


yarn-client /home/hadoop/spark/bin/spark-shell --num-executors 4 --executor-cores 8 \
  --executor-memory 8g —driver-memory 8g

The parameters of the executors (aka worker nodes) should obviously be tailored to the kind of instances you launched. It’s imperative to spend some time understanding and tuning the configuration options as Spark doesn’t automagically do it for you.

Now what?

Time for some real code. Since Spark makes it so easy to write distributed analyses, the bar for a Hello World application should be consequently be much higher. Let’s write then a simple, albeit functional, Recommender Engine for Firefox add-ons.

In order to do that, let’s first go over quickly the math involved. It turns out that given a matrix of the rankings of each user for each add-on, the problem of finding a good recommendation can be reduced to matrix factorization problem:

factorization

The model maps both users and add-ons to a joint latent factor space of dimensionality F. Both users and add-ons are thus seen as vectors in that space. The factors express latent characteristics of add-ons, e.g. if a an add-on is related to security or to UI customization. The ratings are then modeled as inner products in that space, which is proportional to the angle of the two vectors. The closer the characteristics of an add-on align to the preferences of the user in the latent factor space, the higher the rating.

But wait, Firefox users don’t really rate add-ons. In fact the only information we have in Telemetry is binary: either a user has a certain add-on installed or he hasn’t. Let’s assume that if someone has a certain add-on installed, he probably likes that add-on. That’s not true in all cases and a more significant metric like “usage time” or similar should be used.

I am not going to delve into the details, but having binary ratings changes the underlying model slightly from the conceptual one we have just seen. The interested reader should read this paper. Mllib, a machine learning library for Spark, comes out of the box with a distributed implementation of ALS which implements the factorization.

Implementation

Now that we have an idea of the theory, let’s have a look at how the implementation looks like in practice. Let’s start by initializing Spark:


val conf = new SparkConf().setAppName("AddonRecommender")
val sc = new SparkContext(conf)

As the ALS algorithm requires tuples of (user, addon, rating), let’s munge the data into place:


val ratings = sc.textFile("s3://mreid-test-src/split/").map(raw => {
  val parsedPing = parse(raw.substring(37))
  (parsedPing \ "clientID", parsedPing \ "addonDetails" \ "XPI")
}).filter{
  // Remove sessions with missing id or add-on list
  case (JNothing, _) => false
  case (_, JNothing) => false
  case (_, JObject(List())) => false
  case _ => true
}.map{ case (id, xpi) => {
  val addonList = xpi.children.
    map(addon => addon \ "name").
    filter(addon => addon != JNothing && addon != JString("Default"))
  (id, addonList)
}}.filter{ case (id, addonList) => {
  // Remove sessions with empty add-on lists
  !addonList.isEmpty
}}.flatMap{ case (id, addonList) => {
  // Create add-on ratings for each user
  addonList.map(addon => (id.extract[String], addon.extract[String], 1.0))
}}

Here we extract the add-on related data from our json Telemetry pings and filter out missing or invalid data. The ratings variable is a RDD and as you can see we used the distributed map, filter and flatMap operations on it. In fact it’s hard to tell apart vanilla Scala code from the distributed one.

As the current ALS implementation doesn’t accept strings for the user and add-on representations, we will have to convert them to numeric ones. A quick and dirty way of doing that is to hash the strings:


// Positive hash function
def hash(x: String) = x.hashCode & 0x7FFFFF

val hashedRatings = ratings.map{ case(u, a, r) => (hash(u), hash(a), r) }.cache
val addonIDs = ratings.map(_._2).distinct.map(addon => (hash(addon), addon)).cache

We are nearly there. To avoid overfitting, ALS uses regularization, the strength of which is determined by a parameter \lambda . As we don’t know beforehand the optimal value of the parameter, we can try to find it by minimizing the mean squared error over a pre-defined grid of \lambda values using k-fold cross-validation.


// Use cross validation to find the optimal number of latent factors
val folds = MLUtils.kFold(hashedRatings, 10, 42)
val lambdas = List(0.1, 0.2, 0.3, 0.4, 0.5)
val iterations = 10
val factors = 100 // use as many factors as computationally possible

val factorErrors = lambdas.flatMap(lambda => {
  folds.map{ case(train, test) =>
    val model = ALS.trainImplicit(train.map{ case(u, a, r) => Rating(u, a, r) }, factors, iterations, lambda, 1.0)
    val usersAddons = test.map{ case (u, a, r) => (u, a) }
    val predictions = model.predict(usersAddons).map{ case Rating(u, a, r) => ((u, a), r) }
    val ratesAndPreds = test.map{ case (u, a, r) => ((u, a), r) }.join(predictions)
    val rmse = sqrt(ratesAndPreds.map { case ((u, a), (r1, r2)) =>
      val err = (r1 - r2)
      err * err
    }.mean)

    (model, lambda, rmse)
  }
}).groupBy(_._2)
  .map{ case(k, v) => (k, v.map(_._3).reduce(_ + _) / v.length) }

Finally, it’s just a matter of training ALS on the whole dataset with the optimal \lambda value and we are good to go to use the recommender:


// Train model with optimal number of factors on all available data
val model = ALS.trainImplicit(hashedRatings.map{case(u, a, r) => Rating(u, a, r)}, factors, iterations, optimalLambda._1, 1.0)

def recommend(userID: Int) = {
  val predictions = model.predict(addonIDs.map(addonID => (userID, addonID._1)))
  val top = predictions.top(10)(Ordering.by[Rating,Double](_.rating))
  top.map(r => (addonIDs.lookup(r.product)(0), r.rating))
}

recommend(hash("UUID..."))

I omitted some details but you can find the complete source on my github repository.

To submit the packaged job to YARN run:


spark-submit --class AddonRecommender --master yarn-client --num-executors 4 \
  --executor-cores 8 --executor-memory 8g addon-recommender_2.10-1.0.jar

So what?

Question is, how well does it perform? The mean squared error isn’t really telling us much so let’s take some fictional user session and see what the recommender spits out.

For user A that has only the add-on Ghostery installed, the top recommendations are, in order:

  • NoScript
  • Web of Trust
  • Symantec Vulnerability Protection
  • Better Privacy
  • LastPass
  • DuckDuckGo Plus
  • HTTPS-Everywhere
  • Lightbeam
  • Google Translator for Firefox

One could argue that 1 out of 10 recommendations isn’t appropriate for a security aficionado. Now it’s the turn of user B who has only the Firebug add-on installed:

  • Web Developer
  • FiddlerHook
  • Greasemonkey
  • ColorZilla
  • User Agent Switcher
  • McAfee
  • RealPlayer Browser Record Plugin
  • FirePHP
  • Session Manager

There are just a couple of add-ons that don’t look that great but the rest could fit the profile of a developer. Now, considering that the recommender was trained only on a couple of days of data for Nightly, I feel like the result could easily be improved with more data and tuning, like filtering out known Antivirus, malware and bloatware.

Popular hw/sw configurations of Firefox users

Knowing the distribution of our users wrt. cpu/gpu/os etc. is one of those question that comes up time and time again. After a couple of times of running a custom map-reduce job on our Telemetry data I decided to write a periodic job so that we can keep track of it and quickly get the most updated data.

Here is a distribution tree of all the data collected on the 20th of October on the release channel:

distributionThere are many ways of displaying the data but this is one I find particularly useful as I am more interested in the frequency of the combination of factors than than the distribution of the single factors. The size of the factors tries to be proportional to the frequency of the prefix.

For instance, the most common machine configuration has a spinning disk, 4 GB of memory, 4 cores, an Intel GPU and runs Firefox 32.0.3 on Windows 6.1 (aka Windows 7). Note that the GPU refers to the main one detected when Firefox launches. This means that if a machine has more than one accelerator, only the one active during startup is reported. This is clearly suboptimal and we have a bug on file to address this issue.

The online dashboard also allows to dig in the individual nodes and show the cumulative percentage of users for the hovered node.

Correlating Firefox add-ons to slow shutdown times

This is a follow-up to my earlier findings about add-ons and startup times. In this post I am going to dig deeper between the relations of add-ons and shutdown times. A slow shutdown doesn’t seem to be a big deal. If one considers though that a new instance of Firefox can’t be launched if the old one is still shutting down, the issue becomes more serious.

It turns out that for shutdown times a simple linear model is not good enough while a log-linear instead has a reasonably good performance. Log transforming the shutdown times slighly complicates the meaning of the coefficients as they have to be interpreted as the percentage change of the average shutdown time. E.g. if an add-on has a coefficient of 100%, it means that it might (correlation is not causation!) slow down shutdown by 2 times. The idea of using a log-linear model comes from our contributors Jeremy Atia and Martin Gubri [1], which discovered the relationship during a preliminary analysis.

Rplot15Unlike in the startup case, there are fewer stronger relationships here. Some patterns start to emerge though, the Yandex add-on for instance seems to be associated with both slower startup and shutdown timings.

We started to keep track of those results on a weekly basis through a couple iacomus dashboard: one for startup and the other for shutdown times correlations. The dashboards are surely not going to win any design award but they get the job done and didn’t require any effort to setup. I am confident that by spotting consistently ill-behaved add-ons through the time-series we should be able to spot real tangible offenders.

[1] If you love probability, statistics and machine learning and are looking for an open-source project to contribute to, Firefox is a cool place to start! Get in touch with me if that sounds interesting.

Correlating Firefox add-ons to performance bottlenecks

Update: I run the analysis on more data as some add-ons had very few entries with extreme outliers that were skewing the results; I also considered more add-ons.

I started looking into exploiting our Telemetry data to determine which add-ons are causing performance issues with Firefox. So far there are three metrics that I plan to correlate with add-ons:

  • startup time,
  • shutdown time,
  • background hangs.

In this post I am going over my findings for the first scenario, i.e. the relation between startup time and installed add-ons.

In an ideal world, all add-ons would have an uniform way to initialize themselves which could be instrumented. Unfortunately that’s not possible, many add-ons use asynchronous facilities and or rely on observer notifications for initialization. In other words, there is no good way to easily measure the initialization time for all add-ons without possibly touching their codebases individually.

This is the sort of problem that screams for a multi-way ANOVA but, after some thought and data exploration, it turns out that the interaction terms can be dropped between add-ons, i.e. the relation between add-ons and the startup time can be modeled as a pure additive one. Since a multi-way ANOVA is equivalent to a linear regression between a set of predictors and their interactions, the problem can be modeled with a generalized linear model where for each Telemetry submission the add-on map is represented as a boolean vector of dummy variables that can assume a value of 0 or 1 corresponding to “add-on installed” and “add-on not installed”, respectively.

Startup time depends on many other factors that are not taken into account in the model, like current system load and hard drive parameters. This means that it would be very surprising, to say the least, if one could predict the startup time without those variables. That doesn’t mean that we can’t explain part of the variance! In fact, after training the model on the data collected during the past month, it yielded a R^2 score of about 0.15, which in other words means that we can explain about 15% of the variance. Again, as we are not trying to predict the startup time accurately this is not necessarily a bad result. The F ratio, which relates the variance between add-ons to the variance within add-ons, is significant which remarks that having or not certain add-ons installed does influence the startup time.

Many of the p-values of the predictor’s coefficients are highly significant (<< 0.001); it’s just a matter of sorting the significant results by their effect size to determine the add-ons that cause a notable slowdown of Firefox during startup:

Rplot13

The horizontal axis measures the startup time overhead with respect to the average startup time of Firefox. For instance, Yandex Elements seems to be slowing down startup by about 8 seconds on average. The error-bars represent the standard errors of the sampling distributions of the coefficients.

Note that the model is based on a very small fraction of our user-base, i.e. the subset that has Telemetry enabled, so there clearly is some implicit bias. The picture might be different for a truly random sample of our users, nevertheless it is an indication of where to start digging deeper.

The next step is to “dashboardify” the whole thing and contact the developers of the various add-ons. We are also considering notifying users, in a yet to be determined way, when the browser detects add-ons that are known to cause performance issues.

References: map-reduce job