In my previous about our new Spark infrastructure, I went into the details on how to launch a Spark cluster on AWS to perform custom analyses on Telemetry data. Sometimes though one has the need to rerun an analysis recurrently over a certain timeframe, usually to feed data into dashboards of various kinds. We are going to roll out a new feature that allows users to upload an IPython notebook to the self-serve data analysis dashboard and run it on a scheduled basis. The notebook will be executed periodically with the chosen frequency and the result will be made available as an updated IPython notebook.
To schedule a Spark job:
- Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
- Click “Schedule a Spark Job”.
- Enter some details:
- The “Job Name” field should be a short descriptive name, like “chromehangs analysis”.
- Upload your IPython notebook containt the analysis.
- Set the number of workers of the cluster in the “Cluster Size” field.
- Set a schedule frequency using the remaining fields.
Once a new scheduled job is created it will appear in the top listing of the scheduling dashboard. When the job is run its result will be made available as an IPython notebook visible by clicking on the “View Data” entry of your job.
As I briefly mentioned at the beginning, periodic jobs are typically used to feed data to dashboards. Writing dashboards for a custom job isn’t very pleasant and I wrote in the past some simple tool to help with that. It turns out though that thanks to IPython one doesn’t need necessarily to write a dashboard from scratch but can simple re-use the notebook as the dashboard itself! I mean, why not? That might not be good enough for management facing dashboards but acceptable for ones aimed at engineers.
In fact with IPython we are not limited at all to matplotlib’s static charts. Thanks to Plotly, it’s easy enough to generate interactive plots which allow to:
- Check the x and y coordinates of every point on the plot by hovering with the cursor.
- Zoom in on the plot and resize lines, points and axes by clicking and dragging the cursor over a region.
- Pan by holding the shift key while clicking and dragging.
- Zooms back out to the original version by double clicking on the plot.
Plotly comes with its own API but if you have already a matplotlib based chart then it’s trivial to convert it to an interactive plot. As a concrete example, I updated my Spark Hello World example with a plotly chart.
fig = plt.figure(figsize=(18, 7)) frame["WINNT"].plot(kind="hist", bins=50) plt.title("startup distribution for Windows") plt.ylabel("count") plt.xlabel("log(firstPaint)") py.iplot_mpl(fig, strip_style=True)
As you can see, just a single extra line of code is needed for the conversion.
As WordPress doesn’t support iframes, you are going to have to click on the image and follow the link to see the interactive plot in action.