Next-gen Data Analysis Framework for Telemetry

The easier it is to get answers, the more questions will be asked

In that spirit me and Mark Reid have been working for a while now on a new analysis infrastracture to make it as easy as possible for engineers to get answers to data related questions.

Our shiny new analysis infrastructure is based primarily on IPython and Spark. I blogged about Spark before, I even gave a short tutorial on it at our last workweek in Portland (slides and tutorial); IPython might be something you are not familiar with unless you have a background in science. In a nutshell it’s a browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.

IPython
An IPython notebook in all its glory

The combination of IPython and Spark allows to write data analyses interactively from a browser and seemingly parallelize them over multiple machines thanks to a rich API with over 80 distributed operators! It’s a huge leap forward in terms of productivity compared to traditional batch oriented map-reduce frameworks. An IPython notebook contains both the code and the product of the execution of that code, like plots. Once executed, a notebook can simply be serialized and uploaded to Github. Then, thanks to nbviewer, it can be visualized and shared among colleagues.

In fact, the issue with sharing just the end product of an analysis is that it’s all too easy for bugs to creep in or to make wrong assumptions. If your end result is a plot, how do you test it? How do you know that what you are looking at does actually reflect the truth? Having the code side by side with its evaluation allows more people to  inspect it and streamlines the review process.

This is what you need to do to start your IPython backed Spark cluster with access to Telemetry data:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
  2. Click “Launch an ad-hoc Spark cluster”.
  3. Enter some details:
    • The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
    • Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
    • Upload your SSH public key.
  4. Click “Submit”.
  5. A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.

Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, e.g.:

ssh -i my-private-key -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Finally, you can launch IPython in Firefox by visiting http://localhost:8888.

Now what? Glad you asked. In your notebook listing you will see a Hello World notebook. It’s a very simple analysis that produces the distribution of startup times faceted by operating system for a small fraction of Telemetry submissions; let’s quickly review it here.

We start by importing a telemetry utility to fetch pings and some commonly needed libraries for analysis: a json parser, pandas and matplotlib.

import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
from moztelemetry.spark import get_pings

To execute a block of code in IPython, aka cell, press Shift-Enter. While a cell is being executed, a gray circle will appear in the upper right border of the notebook. When the circle is full, your code is being executed by the IPython kernel; when only the borders of the circle are visible then the kernel is idle and waiting for commands.

Spark exploits parallelism across all cores of your cluster. To see the degree of parallelism you have at your disposal simply yield:


sc.defaultParallelism

Now, let’s fetch a set of telemetry submissions and load it in a RDD using the get_pings utility function from the moztelemetry library:

pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid="*",
                  submission_date="20141208",
                  fraction=0.1)

That’s pretty much self documenting. The fraction parameter, which defaults to 1, selects a random subset of the selected submissions. This comes in handy when you first write your analysis and don’t need to load lots of data to test and debug it.

Note that both the buildid and submission_date parameters accept also a tuple specifying, inclusively, a range of dates, e.g.:


pings = get_pings(sc,
                  appName="Firefox",
                  channel="nightly",
                  version="*",
                  buildid=("20141201", "20141202"),
                  submission_date=("20141202", ""20141208"))

Let’s do something with those pings. Since we are interested in the distribution of the startup time of Firefox faceted by operating system, let’s extract the needed fields from our submissions:

def extract(ping):
    ping = json.loads(ping)
    os = ping["info"]["OS"]
    startup = ping["simpleMeasurements"].get("firstPaint", -1)
    return (os, startup)

cached = pings.map(lambda ping: extract(ping)).filter(lambda p: p[1] > 0).cache()

As the Python API matches closely the one used from Scala, I suggest to have a look at my older Spark tutorial if you are not familiar with Spark. Another good resource are the hands-on exercises from AMP Camp 4.

Now, let’s collect the results back and stuff it into a pandas DataFrame. This is a very common pattern, once you reduce your dataset to a manageable size with Spark you collect it back on your driver (aka the master machine) and finalize your analysis with statistical tests, plots and whatnot.

grouped = cached.groupByKey().collectAsMap()

frame = pd.DataFrame({x: log(pd.Series(list(y))) for x, y in grouped.items()})
frame.boxplot()
plt.ylabel("log(firstPaint)")
plt.show()

startup_hello
Startup distribution by OS

Finally, you can save the notebook, upload it to Github or Bugzilla and visualize it on nbviewer, it’s that simple. Here is the nbviewer powered Hello World notebook. I warmly suggest that you open a bug report on Bugzilla for your custom Telemetry analysis and ask me or Vladan Djeric to review it. Mozilla has been doing code reviews for years and with good reasons, why should data analyses be different?

Congrats, you just completed your first Spark analysis with IPython! If you need any help with your custom job feel free to drop me a line in #telemetry.