Next-gen Data Analysis Framework for Telemetry

The easier it is to get answers, the more questions will be asked

In that spirit me and Mark Reid have been working for a while now on a new analysis infrastracture to make it as easy as possible for engineers to get answers to data related questions.

Our shiny new analysis infrastructure is based primarily on IPython and Spark. I blogged about Spark before, I even gave a short tutorial on it at our last workweek in Portland (slides and tutorial); IPython might be something you are not familiar with unless you have a background in science. In a nutshell it’s a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.

IPython
An IPython notebook in all its glory

The combination of IPython and Spark allows to write data analyses interactively from a browser and seemingly parallelize them over multiple machines thanks to a rich API with over 80 distributed operators! It’s a huge leap forward in terms of productivity compared to traditional batch oriented map-reduce frameworks. An IPython notebook contains both the code and the product of the execution of that code, like plots. Once executed, a notebook can simply be serialized and uploaded to Github. Then, thanks to nbviewer, it can be visualized and shared among colleagues.

In fact, the issue with sharing just the end product of an analysis is that it’s all too easy for bugs to creep in or to make wrong assumptions. If your end result is a plot, how do you test it? How do you know that what you are looking at does actually reflect the truth? Having the code side by side with its evaluation allows more people to  inspect it and streamlines the review process.

This is what you need to do to start your IPython backed Spark cluster with access to Telemetry data:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
  2. Click “Launch an ad-hoc Spark cluster”.
  3. Enter some details:
    • The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
    • Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
    • Upload your SSH public key.
  4. Click “Submit”.
  5. A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.

Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, e.g.:

ssh -i my-private-key -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Finally, you can launch IPython in Firefox by visiting http://localhost:8888.

Now what? Glad you asked. In your notebook listing you will see a Hello World notebook. It’s a very simple analysis that produces the distribution of startup times faceted by operating system for a small fraction of Telemetry submissions; let’s quickly review it here.

We start by importing a telemetry utility to fetch pings and some commonly needed libraries for analysis: a json parser, pandas and matplotlib.

import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
from moztelemetry.spark import get_pings

To execute a block of code in IPython, aka cell, press Shift-Enter. While a cell is being executed, a gray circle will appear in the upper right border of the notebook. When the circle is full, your code is being executed by the IPython kernel; when only the borders of the circle are visible then the kernel is idle and waiting for commands.

Spark exploits parallelism across all cores of your cluster. To see the degree of parallelism you have at your disposal simply yield:


sc.defaultParallelism

Now, let’s fetch a set of telemetry submissions and load it in a RDD using the get_pings utility function from the moztelemetry library:

pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid="*",
                  submission_date="20141208",
                  fraction=0.1)

That’s pretty much self documenting. The fraction parameter, which defaults to 1, selects a random subset of the selected submissions. This comes in handy when you first write your analysis and don’t need to load lots of data to test and debug it.

Note that both the buildid and submission_date parameters accept also a tuple specifying, inclusively, a range of dates, e.g.:


pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid=("20141201", "20141202"),
                  submission_date=("20141202", ""20141208"))

Let’s do something with those pings. Since we are interested in the distribution of the startup time of Firefox faceted by operating system, let’s extract the needed fields from our submissions:

def extract(ping):
    ping = json.loads(ping)
    os = ping["info"]["OS"]
    startup = ping["simpleMeasurements"].get("firstPaint", -1)
    return (os, startup)

cached = pings.map(lambda ping: extract(ping)).filter(lambda p: p[1] > 0).cache()

As the Python API matches closely the one used from Scala, I suggest to have a look at my older Spark tutorial if you are not familiar with Spark. Another good resource are the hands-on exercises from AMP Camp 4.

Now, let’s collect the results back and stuff it into a pandas DataFrame. This is a very common pattern, once you reduce your dataset to a manageable size with Spark you collect it back on your driver (aka the master machine) and finalize your analysis with statistical tests, plots and whatnot.

grouped = cached.groupByKey().collectAsMap()

frame = pd.DataFrame({x: log(pd.Series(list(y))) for x, y in grouped.items()})
frame.boxplot()
plt.ylabel("log(firstPaint)")
plt.show()

startup_hello
Startup distribution by OS

Finally, you can save the notebook, upload it to Github or Bugzilla and visualize it on nbviewer, it’s that simple. Here is the nbviewer powered Hello World notebook. I warmly suggest that you open a bug report on Bugzilla for your custom Telemetry analysis and ask me or Vladan Djeric to review it. Mozilla has been doing code reviews for years and with good reasons, why should data analyses be different?

Congrats, you just completed your first Spark analysis with IPython! If you need any help with your custom job feel free to drop me a line in #telemetry.

Deep Neural Nets: a brief recap

Neural nets have experienced a surge in interest in the past years. This post is a summary about the subject from my non-academical point of view.

In the 80s feed-forward neural networks with a single hidden layer became pretty popular when a simple and efficient algorithm was engineered to train them, the backpropagation algorithm.

Feed_forward_neural_net
Feed-forward neural network (source: Wikipedia)

As it’s usual with machine learning algorithms, a naive implementation can be written very concisely without much effort. In fact, all one needs to understand to train a network is the chain rule.

A neural network is nothing more than a hierarchy of composed function applications. Take a net with a single neuron f and loss function L

L(f(\mathbf{x}, \mathbf{w}), y).

If you know how to compute the partial derivative of the loss function L you desire to minimize with respect to its input f(\mathbf{x}, \mathbf{w}), i.e.

\frac{\partial L}{\partial f} ,

and the derivative of f with respect to its weights \mathbf{w}, i.e.

\frac{\partial f}{\partial \mathbf{w}},

you can determine the derivative of L with respect to the weights \mathbf{w} using the chain rule:

\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}

That’s all you need apply gradient descent to update iteratively the parameters w:

\mathbf{w} = \mathbf{w} - \alpha \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}, where \alpha is the learning rate,

in order to reach a solution that minimizes the loss function. It’s easy to imagine applying this procedure recursively on a hierarchy of composed functions.

Unfortunately though training deep networks, i.e. networks with more than one hidden layer, is not a simple task. There are several reasons why the backpropagation algorithm can fail to find a good solution, e.g. vanishing gradient and overfitting come to mind, and it has been explored in depth elsewhere. Interest in neural nets faded away when researchers realized it.

Then, in 2006/2007 researchers have shown that by using unlabeled data it’s possible to train a deep neural net using a greedy approach. The main idea is based on a network of neurons able to reconstruct the original input with the smallest amount of error, similarly to what Principal Component Analysis does. Such a network is called Autoencoder and is composed of three layer:

  • the input layer of dimension d
  • hidden layer of dimension d' with d' < d (simplifying assumption)
  • the output layer of dimension d
Autoencoder636
Autoencoder (source: UFLDL tutorial)

The hidden layer learns a lower dimensional representation of the input which allows to reconstruct the original signal in the output layer with the minimum error possible. Once an Autoencoder is trained,  the representation learned by the hidden layer can be used as input for another Autoencoder. The process is repeated forming a so called stacked Autoencoder. Effectively, each Autoencoder learns a set of new features from the ones learned by the previous Autoencoder. To give you a concrete example, the first Autoencoder might learn to detect the edges of a picture, while the second one contours and so on.

Once you have trained a stacked Autoencoder, you can initialize the weights of a deep network with N hidden layers using the weights of the N hidden layer of the stacked Autoencoder (pre-training). From here on you can use the neural net just like any ordinary one and train it using labeled data (fine-tuning). Using unlabeled data to pre-train a deep neural net was a big thing back in 2006/2007. This older talk by Geoff Hinton was truly inspiring, but things have changed since then.

In 2012 Hinton et. al proofed that it is possible to train a deep convolutional neural net to classify images without any sort of pre-training, and beat at the same time traditional Computer Vision approaches. Since then unsupervised pre-training has mostly stopped being researched in various universities but, nevertheless, it was the culprit that lead to more fundings and ultimately to where we are now.

A state of the art deep convolutional neural network for image classification is based on a handful of powerful ingredients:

  • many hidden layers (or it wouldn’t be a deep net);
  • convolutional layers followed by pooling layers early on;
  • rectified linear units instead of the classic sigmoid activation function, to learn faster;
  • dropout to approximate the average result of many nets, to reduce overfitting.

The devil is in the detail but the basic concepts and ideas are easy to grasp and a simple implementation can be written in a couple of afternoons.

The real difficulty is more of an engineering one, how do you write the most efficient code? If I wet your appetite and are curious to have a look at a serious implementation you should check out cuda-convnet, an extremely well written and efficient multi-GPU based convolutional Neural Net.

A/B test for Telemetry histograms

A/B tests are a simple way to determine the effect caused by a change in a software product against a baseline, i.e. version A against version B. An A/B test is essentially an experiment that indiscriminately assigns a control or experiment condition to each user. It’s an extremely effective method to ascertain causality which is hard, at best, to infer with statistical methods alone. Telemetry comes with its own A/B test implementation, Telemetry Experiments.

Depending on the type of data collected and the question asked, different statistical techniques are used to verify if there is a difference between the experiment and control version:

  1. Does the rate of success of X differ between the two versions?
  2. Does the average value of  Y differ between the two versions?
  3. Does the average time to event Z differ between the two versions?

Those are just the most commonly used methods.

The frequentist statistical hypothesis testing framework is based on a conceptually simple idea: assuming that we live in a world where a certain baseline hypothesis (null hypothesis) is valid, what’s the probability of obtaining the results we observed? If the probability is very low, i.e. under a certain threshold, we gain confidence that the effect we are seeing is genuine.

To give you a concrete example, say I have reason to believe that the average battery duration of my new phone is 5 hours but the manufacturer claims it’s 5.5 hours. If we assume the average battery has indeed a duration of 5.5 hours (null hypothesis), what’s the probability of measuring an average duration that is 30 minutes lower? If the probability is small enough, say under 5%, we “reject” the null hypothesis. Note that there are many things that can go wrong with this framework and one has to be careful in interpreting the results.

Telemetry histograms are a different beast though. Each user submits its own histogram for a certain metric, the histograms are then aggregated across all users for version A and version B. How do you determine if there is a real difference or if what you are looking at is just due to noise? A chi-squared test would seem the most natural choice but on second thought its assumptions are not met as entries in the aggregated histograms are not independent from each other. Luckily we can avoid to sit down and come up with a new mathematically sound statistical test. Meet the permutation test.

Say you have a sample of metric M for users of version A and a sample of metric M for users of version B. You measure a difference of d between the means of the samples. Now you assume there is no difference between A and B and randomly shuffle entries between the two samples and compute again the difference of the means. You do this again, and again, and again… What you end up with is a distribution D of the differences of the means for the all the reshuffled samples. Now, you compute the probability of getting the original difference d, or a more extreme value, by chance and welcome our newborn hypothesis test!

Going back to our original problem of comparing aggregated histograms for the experiment and control group, instead of having means we have aggregated histograms and instead of computing the difference we are considering the distance; everything else remains the same as in the previous example:


def mc_permutation_test(xs, ys, num):
    n, k = len(xs), 0
    h1 = xs.sum()
    h2 = ys.sum()

    diff = histogram_distance(h1, h2)
    zs = pd.concat([xs, ys])
    zs.index = np.arange(0, len(zs))

    for j in range(num):
        zs = zs.reindex(np.random.permutation(zs.index))
        h1 = zs[:n].sum()
        h2 = zs[n:].sum()
        k += diff < histogram_distance(h1, h2)

    return k / num

Most statistical tests were created in a time where there were no [fast] computers around, but nowadays churning a Monte-Carlo permutation test is not a big deal and one can easily run such a test in a reasonable time.