A/B tests are a simple way to determine the effect caused by a change in a software product against a baseline, i.e. version A against version B. An A/B test is essentially an experiment that indiscriminately assigns a control or experiment condition to each user. It’s an extremely effective method to ascertain causality which is hard, at best, to infer with statistical methods alone. Telemetry comes with its own A/B test implementation, Telemetry Experiments.
Depending on the type of data collected and the question asked, different statistical techniques are used to verify if there is a difference between the experiment and control version:
- Does the rate of success of X differ between the two versions?
- Does the average value of Y differ between the two versions?
- Does the average time to event Z differ between the two versions?
Those are just the most commonly used methods.
The frequentist statistical hypothesis testing framework is based on a conceptually simple idea: assuming that we live in a world where a certain baseline hypothesis (null hypothesis) is valid, what’s the probability of obtaining the results we observed? If the probability is very low, i.e. under a certain threshold, we gain confidence that the effect we are seeing is genuine.
To give you a concrete example, say I have reason to believe that the average battery duration of my new phone is 5 hours but the manufacturer claims it’s 5.5 hours. If we assume the average battery has indeed a duration of 5.5 hours (null hypothesis), what’s the probability of measuring an average duration that is 30 minutes lower? If the probability is small enough, say under 5%, we “reject” the null hypothesis. Note that there are many things that can go wrong with this framework and one has to be careful in interpreting the results.
Telemetry histograms are a different beast though. Each user submits its own histogram for a certain metric, the histograms are then aggregated across all users for version A and version B. How do you determine if there is a real difference or if what you are looking at is just due to noise? A chi-squared test would seem the most natural choice but on second thought its assumptions are not met as entries in the aggregated histograms are not independent from each other. Luckily we can avoid to sit down and come up with a new mathematically sound statistical test. Meet the permutation test.
Say you have a sample of metric for users of version A and a sample of metric for users of version B. You measure a difference of between the means of the samples. Now you assume there is no difference between A and B and randomly shuffle entries between the two samples and compute again the difference of the means. You do this again, and again, and again… What you end up with is a distribution of the differences of the means for the all the reshuffled samples. Now, you compute the probability of getting the original difference , or a more extreme value, by chance and welcome our newborn hypothesis test!
Going back to our original problem of comparing aggregated histograms for the experiment and control group, instead of having means we have aggregated histograms and instead of computing the difference we are considering the distance; everything else remains the same as in the previous example:
def mc_permutation_test(xs, ys, num): n, k = len(xs), 0 h1 = xs.sum() h2 = ys.sum() diff = histogram_distance(h1, h2) zs = pd.concat([xs, ys]) zs.index = np.arange(0, len(zs)) for j in range(num): zs = zs.reindex(np.random.permutation(zs.index)) h1 = zs[:n].sum() h2 = zs[n:].sum() k += diff < histogram_distance(h1, h2) return k / num
Most statistical tests were created in a time where there were no [fast] computers around, but nowadays churning a Monte-Carlo permutation test is not a big deal and one can easily run such a test in a reasonable time.