A Telemetry API for Spark

Check out my previous post about Spark and Telemetry data if you want to find out what all the fuzz is about Spark. This post is a step by step guide on how to run a Spark job on AWS and use our simple Scala API to load Telemetry data.

The first step is to start a machine on AWS using Mark Reid’s nifty dashboard, in his own words:

  1. Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona (with an @mozilla.com email address as mentioned above).
  2. Click “Launch an ad-hoc analysis worker”.
  3. Enter some details. The “Server Name” field should be a short descriptive name, something like “mreid chromehangs analysis” is good. Upload your SSH public key (this allows you to log in to the server once it’s started up).
  4. Click “Submit”.
  5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure. Once it’s ready, you can SSH in and run your analysis job. Reload the webpage after a couple of minutes to get the full SSH command you’ll use to log in.

Now connect to the machine and clone my starter project template:


git clone https://github.com/vitillo/mozilla-telemetry-spark.git
cd mozilla-telemetry-spark && source aws/setup.sh

The setup script will install Oracle’s JDK among some other bits. Now we are finally ready to give Spark a spin by launching:

sbt run

The command will run a simple job that computes the Operating System distribution for a small number of pings. It will take a while to complete as sbt, an interactive build tool for Scala, is downloading all required dependencies for Spark on the first run.


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import org.json4s._
import org.json4s.jackson.JsonMethods._

import Mozilla.Telemetry._

object Analysis{
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("mozilla-telemetry").setMaster("local[*]")
    implicit val sc = new SparkContext(conf)
    implicit lazy val formats = DefaultFormats

    val pings = Pings("Firefox", "nightly", "36.0a1", "*", ("20141109", "20141110")).RDD(0.1)

    var osdistribution = pings.map(line => {
      ((parse(line.substring(37)) \ "info" \ "OS").extract[String], 1)
    }).reduceByKey(_+_).collect

    println("OS distribution:")
    osdistribution.map(println)

    sc.stop()
  }
}

If it the job completed successfully, you should see something like this in the output:

OS distribution:
(WINNT,4421)
(Darwin,38)
(Linux,271)

To start writing your job, simply customize src/main/scala/main.scala to your needs. The Telemetry Scala API allows you to define a RDD for a Telemetry filter:

val pings = Pings("Firefox", "nightly", "36.0a1", "*", ("20141109", "20141110")).RDD(0.1)

The above statement retrieves a sample of 10% of all Telemetry submissions of Firefox received on the 9th and 10th of November for any build-id of nightly 36. The last 2 parameters of Pings, i.e. build-id and date, accept either a single value or a tuple specifying a range.

If you are interested to learn more, there is going to be a Spark & Telemetry tutorial at MozLandia next week! I will briefly go over the data layout of Telemetry and how Spark works under the hood and finally jump in a hands-on interactive analysis session with real data. No prerequisites are required in terms of Telemetry, Spark or distributed computing.

Time: Friday, December 5, 2014, 1 PM
Location: Belmont Room, Marriott Waterfront 2nd, Mozlandia

Update
I have uploaded the slides and the tutorial.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s