# Bayesian Statistics 101: BANOVA

In this part we are going to explore the Bayesian equivalent of ANOVA, i.e. BANOVA. ANOVA is a statistical test used to ascertain if the means of several groups are equal, and therefore generalizes the t-test to more than two groups. In fact, when there are only two groups to compare, the t-statistics is equivalent to the square of the F-statistic, where the latter is ratio of the variance between group means over the variance within the groups.

For simplicity, we will explore the Bayesian counterpart of ANOVA using just two groups but it’s important to realize that the method can be generalized to an arbitrary number of them.

Let’s formulate the problem. What we are trying to do here is to predict a dependent numerical variable given an independent nominal one with 2 levels (groups). If the levels of the independent variable truly influence the dependent one, then we expect one or more means of the groups to differ significantly from each other. In other words, it’s just another form of linear regression.

Let’s rephrase the model mathematically by dummy coding the nominal predictor and using the mean value of the predicted variable as baseline:

$\mu_i = \beta_0 + \sum\limits_{j} \beta_j x_{ji}$ (1)

with $\sum\limits_{j=1} \beta_j = 0$ (2)

$\beta_0$ is the mean value of the predicted variable while $\beta_j$ is the quantity by which the mean of the predicted variable changes for the level $j$. Equation (2) is needed to constrain the baseline so that the sum of all deviations from the mean is zero. A careful reader might have noticed that the way we defined the dummy variables, i.e. using the sum-to-zero constraint (STZ), is different from the one commonly seen in the wild known as corner constraints (CR). In (CR) $\beta_o$ represents the mean of the reference group while in (STZ) it represents the overall mean of the effect in all groups.

Let’s start by simulating some data about 2 hypothetical height distributions. Simulating a real phenomena comes in handy to verify that the model does actually what it’s supposed to do.

sd = 5.3
n = 100
numpy.random.seed(42)
grp1 = numpy.random.normal(177.5, sd, n)
grp2 = numpy.random.normal(181.8, sd, n)
data = numpy.concatenate((grp1, grp2))

origin_mean = data.mean()
origin_std = data.std(ddof=1)
data = (data - origin_mean)/origin_std # standardize
grp = n*[0] + n*[1]


As you can see we are making the assumption that both groups have the same variance. That’s not always the case and the beauty of probabilistic programming is that models
can easily be adapted to the process that best describes the data. Standardizing the data helps with specifying general, unit independent, priors:

alpha_sd = pymc.distributions.Gamma("alpha_sd", alpha=1.01005, beta=0.1005) # mode=0.1, sd=10.0

alpha0 = pymc.distributions.Normal("alpha0", mu=0, tau=0.001) # y values are assumed to be standardized
alpha = pymc.distributions.Normal("alpha", mu=0, tau=1/alpha_sd**2, size=len(numpy.bincount(grp)))

sigma = pymc.distributions.Uniform("sigma", 0, 10) # y values are assumed to be standardized
likelihood = pymc.distributions.Normal("likelihood", mu=alpha0 + alpha[grp], tau=1/sigma**2, value=data, observed=True)

model = pymc.Model([alpha0, alpha, sigma, alpha_sd])

map_ = pymc.MAP(model)
map_.fit()

thin = 750
burn = 10000
steps = 5000*thin + burn

mcmc = pymc.MCMC(model)
mcmc.sample(iter=steps, burn=burn, thin=thin)

pymc.Matplot.plot(mcmc)


Note that are coefficients are prefixed with alpha instead of beta and that Equation (2) is nowhere to be found.  The reason for the former follows from the latter. To get the constrained beta values we have to mutate the alpha values according to the constraints of Equation (2) and finally unstandardize the coefficients as shown in the snippet below.

Also, notice that the variance, or precision, of the alpha coefficients depend on the same hyperprior (i.e. prior of a prior) alpha_sd (line 4). The hyperprior acts as regularizer to keep estimates from overfitting the data. The regularizer comes into play when many coefficients have to be estimated; if many groups have a small deflection from the overall mean, then the variance estimated by the hyperprior will be small which in turn will shrink the estimates of the coefficients.

Finally, notice that we had to use extensive thinning to battle the effects of autocorrelation.

alpha_samples = mcmc.trace("alpha")[:]
alpha0 = mcmc.trace("alpha0")[:]
alpha1 = alpha_samples[:, 0]
alpha2 = alpha_samples[:, 1]

# Convert to zero-centered b-values
m1 = alpha0 + alpha1
m2 = alpha0 + alpha2
merged = numpy.vstack((m1, m2)).T

b0 = [x.mean() for x in merged]
b1 = m1 - b0
b2 = m2 - b0

# Unstandardize
b0 = b0*origin_std + origin_mean
b1 = b1*origin_std
b2 = b2*origin_std


Once we have the estimates of our coefficients, we can easily perform contrasts between them without worrying about multiple comparison corrections as in ANOVA.

# Constrast between b1 and b2
plt.hist(b2 - b1, 50, normed=1, facecolor='g', alpha=0.75)
hpd = pymc.utils.hpd(b2 - b1, 1. - 0.95)
plt.axvline(x=hpd[0], linewidth=2, color='r')
plt.axvline(x=hpd[1], linewidth=2, color='r')



As the plot shows, we are confident that the two groups differ and the result does indeed reflect the simulated data.

We are just barely scratching the surface here. Some great books to learn more about BANOVA and Bayesian statistics are Bayesian Data Analysis and Doing Bayesian Data Analysis: A Tutorial with R and BUGS .

# Correlating Firefox add-ons to slow shutdown times

This is a follow-up to my earlier findings about add-ons and startup times. In this post I am going to dig deeper between the relations of add-ons and shutdown times. A slow shutdown doesn’t seem to be a big deal. If one considers though that a new instance of Firefox can’t be launched if the old one is still shutting down, the issue becomes more serious.

It turns out that for shutdown times a simple linear model is not good enough while a log-linear instead has a reasonably good performance. Log transforming the shutdown times slighly complicates the meaning of the coefficients as they have to be interpreted as the percentage change of the average shutdown time. E.g. if an add-on has a coefficient of 100%, it means that it might (correlation is not causation!) slow down shutdown by 2 times. The idea of using a log-linear model comes from our contributors Jeremy Atia and Martin Gubri [1], which discovered the relationship during a preliminary analysis.

Unlike in the startup case, there are fewer stronger relationships here. Some patterns start to emerge though, the Yandex add-on for instance seems to be associated with both slower startup and shutdown timings.

We started to keep track of those results on a weekly basis through a couple iacomus dashboard: one for startup and the other for shutdown times correlations. The dashboards are surely not going to win any design award but they get the job done and didn’t require any effort to setup. I am confident that by spotting consistently ill-behaved add-ons through the time-series we should be able to spot real tangible offenders.

[1] If you love probability, statistics and machine learning and are looking for an open-source project to contribute to, Firefox is a cool place to start! Get in touch with me if that sounds interesting.

# Bayesian Statistics 101: Linear Regression

In this second part of the series we are going to have a look at linear regression. Say we have some variable $x$ that we would like to use to predict another variable $y$. We could model the relationship as a linear one and since we can’t perfectly predict a variable from another one without considering the error, we can model the noise with a normal distribution:  $y = \beta_0 + \beta_1*x + \epsilon$, where $\epsilon = \mathcal{N}(0, \sigma)$.

So far so good but let’s have a look now at a more concrete example where we try to predict the weight of a person from its height. Instead of collecting some real data, we can simulate some. That usually comes in handy to verify the prediction of the model before the actual real data is used.

# Let's simulate some data of the height and weight of a population
height_mu = 175.7
height_sd = 7.3
weight_mu = 72.6
weight_sd = 15.9
rho = 0.42
mean = [height_mu, weight_mu]
cov = [[height_sd**2, rho*height_sd*weight_sd],
[rho*height_sd*weight_sd, weight_sd**2]]
men = numpy.random.multivariate_normal(mean, cov, 100)


In the code above we generate 100 tuples of weight and height expressed respectively in kg and cm using a bivariate normal. This dataset will represent the input data for our model.

Now we can express the model as a linear regression as shown in the following snippet:

# Let's model the problem</pre>
beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12)
beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12)
tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001)

@deterministic
def mu(beta0=beta0, beta1=beta1):
return beta0 + men[:,0]*beta1

likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True)

model = pymc.Model([beta0, beta1, tau])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=40000, burn=10000)

pymc.Matplot.plot(mcmc)


As the trace plot of the slope $\beta_1$ shows, it fails to converge due to high autocorrelation. What’s going on? It turns out that because $\beta_0$ and $\beta_1$ are highly correlated, the Metropolis sampler gets “stuck” in the tightly correlated space. There are several options to reduce the issue:

• using an adaptive metropolis sampler,
• increasing the standard deviation of the proposal distribution,
• using an approximation method on the model before running the MCMC sampler to speed up convergence, e.g., the maximum a posteriori (MAP) estimate can be obtained using numerical optimization, then used as the initial values for the stochastic variables,
• thinning the traces, i.e. throwing away every nth sample to reduce the autocorrelation,
• increasing the burn-in, i.e. throwing away more samples at the beginning of the trace which are likely not going to be representative of the posterior distribution,
• standardizing the input, i.e. rescaling the data relative to the mean and standard deviation: $z_i = (x_i - \mu_x)/\sigma_x$.

In our case setting the initial values of the stochastic variables to the MAP estimates and standardizing the data do a good job.

Let’s standardize the data first:

# Standardize the input data to reduce autocorrelation
mean_height = men[:,0].mean()
mean_weight = men[:,1].mean()
sd_height = men[:,0].std()
sd_weight = men[:,1].std()
men[:,0] = (men[:,0] - mean_height)/sd_height
men[:,1] = (men[:,1] - mean_weight)/sd_weight


And now we can train our model using the standardized data and the MAP estimates as initial values:

beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12)
beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12)
tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001)

@deterministic
def mu(beta0=beta0, beta1=beta1):
return beta0 + men[:,0]*beta1

likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True)

model = pymc.Model([beta0, beta1, tau])
map_ = pymc.MAP(model)
map_.fit()
mcmc = pymc.MCMC(model)
mcmc.sample(iter=40000, burn=10000)

pymc.Matplot.plot(mcmc)


The convergence of the posterior traces look much better now, e.g. for $\beta_1$:

The only thing left to do is to “un-standardize” the posterior distributions to express the values in weight and weight/height for the intercept and slope respectively:

beta0_samples = mcmc.trace('beta0')[:]
beta1_samples = mcmc.trace('beta1')[:]
beta0_unstd = beta0_samples*sd_weight + mean_weight - beta1_samples*sd_weight*mean_height/sd_height
beta1_unstd = beta1_samples*sd_weight/sd_height

print "Unstandardized beta0 HDI: " + ",".join(map(str, pymc.utils.hpd(beta0_unstd , 1.- 0.95)))
print "Unstandardized beta1 HDI: " + ",".join(map(str, pymc.utils.hpd(beta1_unstd , 1.- 0.95)))


which yields:

Unstandardized beta0 HDI: -0.187313759236,0.194719066091
Unstandardized beta1 HDI: 0.0833251532747,0.471270731396


Since the Highest Density Interval doesn’t include 0 for $\beta_1$, we can state that there is indeed a linear relation between weight and height and we can use one to predict the other.

# Correlating Firefox add-ons to performance bottlenecks

Update: I run the analysis on more data as some add-ons had very few entries with extreme outliers that were skewing the results; I also considered more add-ons.

I started looking into exploiting our Telemetry data to determine which add-ons are causing performance issues with Firefox. So far there are three metrics that I plan to correlate with add-ons:

• startup time,
• shutdown time,
• background hangs.

In this post I am going over my findings for the first scenario, i.e. the relation between startup time and installed add-ons.

In an ideal world, all add-ons would have an uniform way to initialize themselves which could be instrumented. Unfortunately that’s not possible, many add-ons use asynchronous facilities and or rely on observer notifications for initialization. In other words, there is no good way to easily measure the initialization time for all add-ons without possibly touching their codebases individually.

This is the sort of problem that screams for a multi-way ANOVA but, after some thought and data exploration, it turns out that the interaction terms can be dropped between add-ons, i.e. the relation between add-ons and the startup time can be modeled as a pure additive one. Since a multi-way ANOVA is equivalent to a linear regression between a set of predictors and their interactions, the problem can be modeled with a generalized linear model where for each Telemetry submission the add-on map is represented as a boolean vector of dummy variables that can assume a value of 0 or 1 corresponding to “add-on installed” and “add-on not installed”, respectively.

Startup time depends on many other factors that are not taken into account in the model, like current system load and hard drive parameters. This means that it would be very surprising, to say the least, if one could predict the startup time without those variables. That doesn’t mean that we can’t explain part of the variance! In fact, after training the model on the data collected during the past month, it yielded a $R^2$ score of about 0.15, which in other words means that we can explain about 15% of the variance. Again, as we are not trying to predict the startup time accurately this is not necessarily a bad result. The F ratio, which relates the variance between add-ons to the variance within add-ons, is significant which remarks that having or not certain add-ons installed does influence the startup time.

Many of the p-values of the predictor’s coefficients are highly significant (<< 0.001); it’s just a matter of sorting the significant results by their effect size to determine the add-ons that cause a notable slowdown of Firefox during startup:

The horizontal axis measures the startup time overhead with respect to the average startup time of Firefox. For instance, Yandex Elements seems to be slowing down startup by about 8 seconds on average. The error-bars represent the standard errors of the sampling distributions of the coefficients.

Note that the model is based on a very small fraction of our user-base, i.e. the subset that has Telemetry enabled, so there clearly is some implicit bias. The picture might be different for a truly random sample of our users, nevertheless it is an indication of where to start digging deeper.

The next step is to “dashboardify” the whole thing and contact the developers of the various add-ons. We are also considering notifying users, in a yet to be determined way, when the browser detects add-ons that are known to cause performance issues.

# Bayesian Statistics 101: Inference of population proportions

This is the first part of a collection of snippets written in python and pymc that show the Bayesian counterparts of classical Frequentist methods typically taught in a first statistics class.

Inference on a population proportion
In the following snippet we estimate the distribution of the probability of success $p$ for a Binomial Random Variable, given a sequence of number of successes for $N$ trials. An uninformative Beta distribution is used as prior for the unknown parameter $p$ whereas the Binomial distribution represents the likelihood, in Bayesian parlance.

n = 8
data = [4, 5, 6, 5, 5, 4, 3]

p = Beta("p", alpha=1, beta=1)
y = Binomial("y", n=n, p=p, value=data, observed=True)

model = Model([y, p])
mcmc = MCMC(model)
mcmc.sample(40000, 10000, 1)

pm.Matplot.plot(mcmc)


Note that the dotted lines in the graph on the right side delimit the 95% credible interval. Stackexchange has a wonderful explanation on the difference between a confidence interval and a credible interval.

Inference on the difference of two population proportions

The difference of two population proportions comes in handy when we want to ascertain if a proportion is, statistically speaking, different from another. If the credible interval doesn’t contain 0, then we can make the claim that the proportions are different.

n1 = 8
data1 = [4, 5, 6, 5, 5, 4, 3]

n2 = 20
data2 = [16, 15, 17, 11, 18, 16, 18]

p1 = Beta("p1", alpha=1, beta=1)
y1 = Binomial("y1", n=n1, p=p1, value=data1, observed=True)

p2 = Beta("p2", alpha=1, beta=1)
y2 = Binomial("y2", n=n2, p=p2, value=data2, observed=True)

@deterministic
def diff(p1=p1, p2=p2):
return p2 - p1

model = pm.Model([p1, y1, p2, y2, diff])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)

pm.Matplot.plot(mcmc)


The Bayesian priors regularize the model. In the previous examples we used a set of uninformative priors. Say we had, from the past literature, a strong belief that the proportions are equal and very close to 0.5. We can express that belief with e.g. $Beta(500, 500)$.

...
p1 = Beta("p1", alpha=500, beta=500)
y1 = Binomial("y1", n=n1, p=p1, value=data1, observed=True)

p2 = Beta("p2", alpha=500, beta=500)
y2 = Binomial("y2", n=n2, p=p2, value=data2, observed=True)
...


In this case we can’t make an argument that the proportions are different because of our strong priors. If we would collect more data and feed it to our model, eventually the evidence would win over the prior. Also note that the variance of the deterministic variable diff has drastically been reduced, this is another side effect due to the use of strong priors.

It’s also interesting to have a look at the posterior of $p2$. We started with a prior with a peak at 0.5. As one can see in the graph below, the evidence changed only slightly our belief from 0.5 to ~0.54: