Confidence intervals and hypothesis tests for engineers

I wrote a small parallel library for python that implements the permutation test and the bias corrected version of the bootstrap, which gives non-statisticians the ability to exploit confidence intervals and hypothesis tests for arbitrary statistics at the expense of some CPU cycles.

Modern hardware allows us to understand and compute statistics in ways that were not possible when the field was born. Resampling methods allow to quantify uncertainty with fewer assumptions and greater accuracy, at a higher computational cost, a paradigm shift in the mindset of modern statistics.

Confidence intervals are based on the idea of the sampling distribution of a statistics, that is the distribution of values of the statistic over all possible samples of the same size. Given such sampling distribution, it’s easy to build a confidence interval. As we usually have a single sample, statisticians devised formulas to compute a confidence interval assuming that the sampling distribution has a certain well-known shape.

As a concrete example, the sampling distribution of the sample mean has a normal distribution which follows from the central limit theorem. If you are looking for the sampling distribution for say, the trimmed mean or the median, things are considerably harder. Exotic formula do exist in some cases but the bootstrap provides a computational way of approximating the sampling distribution without any assumption on its shape and spread.

The bootstrap of a statistic draws thousands of resamples with replacement from the original sample and computes the distribution of the statistic of those samples. This distribution approximates the shape, spread and bias of the real sampling distribution but is centered at the statistic of the of sample in the best case and can be affected by a considerable bias in the worst case. There are techniques though to remove the bias from the bootstrap distribution.

bootstrap

If the sample is a good approximation of the population, the bootstrap method will provide a good approximation of the sampling distribution. As a rule of thumb you should have at least 50 independent data points before applying the method with at least 1000 bootstrap samples. Also, trying to apply the bootstrap on some very weird statistics that depend on few values of the sample, like the maximum, is a recipe for disaster.

I wrote in the past about the permutation test and how I used it to implement a hypothesis test for Telemetry histograms, so I am not going to reiterate its core ideas here. What’s important to understand though is that it assumes that the observations are exchangeable under the null hypothesis. This implies that the observations viewed individually must be identically distributed.

A/B test for Telemetry histograms

A/B tests are a simple way to determine the effect caused by a change in a software product against a baseline, i.e. version A against version B. An A/B test is essentially an experiment that indiscriminately assigns a control or experiment condition to each user. It’s an extremely effective method to ascertain causality which is hard, at best, to infer with statistical methods alone. Telemetry comes with its own A/B test implementation, Telemetry Experiments.

Depending on the type of data collected and the question asked, different statistical techniques are used to verify if there is a difference between the experiment and control version:

  1. Does the rate of success of X differ between the two versions?
  2. Does the average value of  Y differ between the two versions?
  3. Does the average time to event Z differ between the two versions?

Those are just the most commonly used methods.

The frequentist statistical hypothesis testing framework is based on a conceptually simple idea: assuming that we live in a world where a certain baseline hypothesis (null hypothesis) is valid, what’s the probability of obtaining the results we observed? If the probability is very low, i.e. under a certain threshold, we gain confidence that the effect we are seeing is genuine.

To give you a concrete example, say I have reason to believe that the average battery duration of my new phone is 5 hours but the manufacturer claims it’s 5.5 hours. If we assume the average battery has indeed a duration of 5.5 hours (null hypothesis), what’s the probability of measuring an average duration that is 30 minutes lower? If the probability is small enough, say under 5%, we “reject” the null hypothesis. Note that there are many things that can go wrong with this framework and one has to be careful in interpreting the results.

Telemetry histograms are a different beast though. Each user submits its own histogram for a certain metric, the histograms are then aggregated across all users for version A and version B. How do you determine if there is a real difference or if what you are looking at is just due to noise? A chi-squared test would seem the most natural choice but on second thought its assumptions are not met as entries in the aggregated histograms are not independent from each other. Luckily we can avoid to sit down and come up with a new mathematically sound statistical test. Meet the permutation test.

Say you have a sample of metric M for users of version A and a sample of metric M for users of version B. You measure a difference of d between the means of the samples. Now you assume there is no difference between A and B and randomly shuffle entries between the two samples and compute again the difference of the means. You do this again, and again, and again… What you end up with is a distribution D of the differences of the means for the all the reshuffled samples. Now, you compute the probability of getting the original difference d, or a more extreme value, by chance and welcome our newborn hypothesis test!

Going back to our original problem of comparing aggregated histograms for the experiment and control group, instead of having means we have aggregated histograms and instead of computing the difference we are considering the distance; everything else remains the same as in the previous example:


def mc_permutation_test(xs, ys, num):
    n, k = len(xs), 0
    h1 = xs.sum()
    h2 = ys.sum()

    diff = histogram_distance(h1, h2)
    zs = pd.concat([xs, ys])
    zs.index = np.arange(0, len(zs))

    for j in range(num):
        zs = zs.reindex(np.random.permutation(zs.index))
        h1 = zs[:n].sum()
        h2 = zs[n:].sum()
        k += diff < histogram_distance(h1, h2)

    return k / num

Most statistical tests were created in a time where there were no [fast] computers around, but nowadays churning a Monte-Carlo permutation test is not a big deal and one can easily run such a test in a reasonable time.