I wrote a small parallel library for python that implements the permutation test and the bias corrected version of the bootstrap, which gives non-statisticians the ability to exploit confidence intervals and hypothesis tests for arbitrary statistics at the expense of some CPU cycles.
Modern hardware allows us to understand and compute statistics in ways that were not possible when the field was born. Resampling methods allow to quantify uncertainty with fewer assumptions and greater accuracy, at a higher computational cost, a paradigm shift in the mindset of modern statistics.
Confidence intervals are based on the idea of the sampling distribution of a statistics, that is the distribution of values of the statistic over all possible samples of the same size. Given such sampling distribution, it’s easy to build a confidence interval. As we usually have a single sample, statisticians devised formulas to compute a confidence interval assuming that the sampling distribution has a certain well-known shape.
As a concrete example, the sampling distribution of the sample mean has a normal distribution which follows from the central limit theorem. If you are looking for the sampling distribution for say, the trimmed mean or the median, things are considerably harder. Exotic formula do exist in some cases but the bootstrap provides a computational way of approximating the sampling distribution without any assumption on its shape and spread.
The bootstrap of a statistic draws thousands of resamples with replacement from the original sample and computes the distribution of the statistic of those samples. This distribution approximates the shape, spread and bias of the real sampling distribution but is centered at the statistic of the of sample in the best case and can be affected by a considerable bias in the worst case. There are techniques though to remove the bias from the bootstrap distribution.
If the sample is a good approximation of the population, the bootstrap method will provide a good approximation of the sampling distribution. As a rule of thumb you should have at least 50 independent data points before applying the method with at least 1000 bootstrap samples. Also, trying to apply the bootstrap on some very weird statistics that depend on few values of the sample, like the maximum, is a recipe for disaster.
I wrote in the past about the permutation test and how I used it to implement a hypothesis test for Telemetry histograms, so I am not going to reiterate its core ideas here. What’s important to understand though is that it assumes that the observations are exchangeable under the null hypothesis. This implies that the observations viewed individually must be identically distributed.
In my previous about our new Spark infrastructure, I went into the details on how to launch a Spark cluster on AWS to perform custom analyses on Telemetry data. Sometimes though one has the need to rerun an analysis recurrently over a certain timeframe, usually to feed data into dashboards of various kinds. We are going to roll out a new feature that allows users to upload an IPython notebook to the self-serve data analysis dashboard and run it on a scheduled basis. The notebook will be executed periodically with the chosen frequency and the result will be made available as an updated IPython notebook.
To schedule a Spark job:
- Visit the analysis provisioning dashboard at telemetry-dash.mozilla.org and sign in using Persona with an @mozilla.com email address.
- Click “Schedule a Spark Job”.
- Enter some details:
- The “Job Name” field should be a short descriptive name, like “chromehangs analysis”.
- Upload your IPython notebook containt the analysis.
- Set the number of workers of the cluster in the “Cluster Size” field.
- Set a schedule frequency using the remaining fields.
Once a new scheduled job is created it will appear in the top listing of the scheduling dashboard. When the job is run its result will be made available as an IPython notebook visible by clicking on the “View Data” entry of your job.
As I briefly mentioned at the beginning, periodic jobs are typically used to feed data to dashboards. Writing dashboards for a custom job isn’t very pleasant and I wrote in the past some simple tool to help with that. It turns out though that thanks to IPython one doesn’t need necessarily to write a dashboard from scratch but can simple re-use the notebook as the dashboard itself! I mean, why not? That might not be good enough for management facing dashboards but acceptable for ones aimed at engineers.
In fact with IPython we are not limited at all to matplotlib’s static charts. Thanks to Plotly, it’s easy enough to generate interactive plots which allow to:
- Check the x and y coordinates of every point on the plot by hovering with the cursor.
- Zoom in on the plot and resize lines, points and axes by clicking and dragging the cursor over a region.
- Pan by holding the shift key while clicking and dragging.
- Zooms back out to the original version by double clicking on the plot.
Plotly comes with its own API but if you have already a matplotlib based chart then it’s trivial to convert it to an interactive plot. As a concrete example, I updated my Spark Hello World example with a plotly chart.
fig = plt.figure(figsize=(18, 7))
plt.title("startup distribution for Windows")
As you can see, just a single extra line of code is needed for the conversion.
As WordPress doesn’t support iframes, you are going to have to click on the image and follow the link to see the interactive plot in action.