Bayesian Statistics 101: Linear Regression

In this second part of the series we are going to have a look at linear regression. Say we have some variable x that we would like to use to predict another variable y. We could model the relationship as a linear one and since we can’t perfectly predict a variable from another one without considering the error, we can model the noise with a normal distribution:  y = \beta_0 + \beta_1*x + \epsilon, where \epsilon = \mathcal{N}(0, \sigma).

So far so good but let’s have a look now at a more concrete example where we try to predict the weight of a person from its height. Instead of collecting some real data, we can simulate some. That usually comes in handy to verify the prediction of the model before the actual real data is used.

# Let's simulate some data of the height and weight of a population
height_mu = 175.7
height_sd = 7.3
weight_mu = 72.6
weight_sd = 15.9
rho = 0.42
mean = [height_mu, weight_mu]
cov = [[height_sd**2, rho*height_sd*weight_sd],
       [rho*height_sd*weight_sd, weight_sd**2]]
men = numpy.random.multivariate_normal(mean, cov, 100)

In the code above we generate 100 tuples of weight and height expressed respectively in kg and cm using a bivariate normal. This dataset will represent the input data for our model.

Now we can express the model as a linear regression as shown in the following snippet:

# Let's model the problem</pre>
beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12)
beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12)
tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001)

@deterministic
def mu(beta0=beta0, beta1=beta1):
    return beta0 + men[:,0]*beta1

likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True)

model = pymc.Model([beta0, beta1, tau])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=40000, burn=10000)

pymc.Matplot.plot(mcmc)

lr

As the trace plot of the slope \beta_1 shows, it fails to converge due to high autocorrelation. What’s going on? It turns out that because \beta_0 and \beta_1 are highly correlated, the Metropolis sampler gets “stuck” in the tightly correlated space. There are several options to reduce the issue:

  • using an adaptive metropolis sampler,
  • increasing the standard deviation of the proposal distribution,
  • using an approximation method on the model before running the MCMC sampler to speed up convergence, e.g., the maximum a posteriori (MAP) estimate can be obtained using numerical optimization, then used as the initial values for the stochastic variables,
  • thinning the traces, i.e. throwing away every nth sample to reduce the autocorrelation,
  • increasing the burn-in, i.e. throwing away more samples at the beginning of the trace which are likely not going to be representative of the posterior distribution,
  • standardizing the input, i.e. rescaling the data relative to the mean and standard deviation: z_i = (x_i - \mu_x)/\sigma_x.

In our case setting the initial values of the stochastic variables to the MAP estimates and standardizing the data do a good job.

Let’s standardize the data first:

# Standardize the input data to reduce autocorrelation
mean_height = men[:,0].mean()
mean_weight = men[:,1].mean()
sd_height = men[:,0].std()
sd_weight = men[:,1].std()
men[:,0] = (men[:,0] - mean_height)/sd_height
men[:,1] = (men[:,1] - mean_weight)/sd_weight

And now we can train our model using the standardized data and the MAP estimates as initial values:

beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12)
beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12)
tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001)

@deterministic
def mu(beta0=beta0, beta1=beta1):
    return beta0 + men[:,0]*beta1

likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True)

model = pymc.Model([beta0, beta1, tau])
map_ = pymc.MAP(model)
map_.fit()
mcmc = pymc.MCMC(model)
mcmc.sample(iter=40000, burn=10000)

pymc.Matplot.plot(mcmc)

The convergence of the posterior traces look much better now, e.g. for \beta_1:
beta1
The only thing left to do is to “un-standardize” the posterior distributions to express the values in weight and weight/height for the intercept and slope respectively:

beta0_samples = mcmc.trace('beta0')[:]
beta1_samples = mcmc.trace('beta1')[:]
beta0_unstd = beta0_samples*sd_weight + mean_weight - beta1_samples*sd_weight*mean_height/sd_height
beta1_unstd = beta1_samples*sd_weight/sd_height 

print "Unstandardized beta0 HDI: " + ",".join(map(str, pymc.utils.hpd(beta0_unstd , 1.- 0.95)))
print "Unstandardized beta1 HDI: " + ",".join(map(str, pymc.utils.hpd(beta1_unstd , 1.- 0.95)))

which yields:

Unstandardized beta0 HDI: -0.187313759236,0.194719066091
Unstandardized beta1 HDI: 0.0833251532747,0.471270731396

Since the Highest Density Interval doesn’t include 0 for \beta_1, we can state that there is indeed a linear relation between weight and height and we can use one to predict the other.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s