In this second part of the series we are going to have a look at linear regression. Say we have some variable that we would like to use to predict another variable . We could model the relationship as a linear one and since we can’t perfectly predict a variable from another one without considering the error, we can model the noise with a normal distribution: , where .
So far so good but let’s have a look now at a more concrete example where we try to predict the weight of a person from its height. Instead of collecting some real data, we can simulate some. That usually comes in handy to verify the prediction of the model before the actual real data is used.
# Let's simulate some data of the height and weight of a population height_mu = 175.7 height_sd = 7.3 weight_mu = 72.6 weight_sd = 15.9 rho = 0.42 mean = [height_mu, weight_mu] cov = [[height_sd**2, rho*height_sd*weight_sd], [rho*height_sd*weight_sd, weight_sd**2]] men = numpy.random.multivariate_normal(mean, cov, 100)
In the code above we generate 100 tuples of weight and height expressed respectively in kg and cm using a bivariate normal. This dataset will represent the input data for our model.
Now we can express the model as a linear regression as shown in the following snippet:
# Let's model the problem</pre> beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12) beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12) tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001) @deterministic def mu(beta0=beta0, beta1=beta1): return beta0 + men[:,0]*beta1 likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True) model = pymc.Model([beta0, beta1, tau]) mcmc = pymc.MCMC(model) mcmc.sample(iter=40000, burn=10000) pymc.Matplot.plot(mcmc)
As the trace plot of the slope shows, it fails to converge due to high autocorrelation. What’s going on? It turns out that because and are highly correlated, the Metropolis sampler gets “stuck” in the tightly correlated space. There are several options to reduce the issue:
- using an adaptive metropolis sampler,
- increasing the standard deviation of the proposal distribution,
- using an approximation method on the model before running the MCMC sampler to speed up convergence, e.g., the maximum a posteriori (MAP) estimate can be obtained using numerical optimization, then used as the initial values for the stochastic variables,
- thinning the traces, i.e. throwing away every nth sample to reduce the autocorrelation,
- increasing the burn-in, i.e. throwing away more samples at the beginning of the trace which are likely not going to be representative of the posterior distribution,
- standardizing the input, i.e. rescaling the data relative to the mean and standard deviation: .
In our case setting the initial values of the stochastic variables to the MAP estimates and standardizing the data do a good job.
Let’s standardize the data first:
# Standardize the input data to reduce autocorrelation mean_height = men[:,0].mean() mean_weight = men[:,1].mean() sd_height = men[:,0].std() sd_weight = men[:,1].std() men[:,0] = (men[:,0] - mean_height)/sd_height men[:,1] = (men[:,1] - mean_weight)/sd_weight
And now we can train our model using the standardized data and the MAP estimates as initial values:
beta0 = pymc.distributions.Normal("beta0", mu=0, tau=1e-12) beta1 = pymc.distributions.Normal("beta1", mu=0, tau=1e-12) tau = pymc.distributions.Gamma("tau", alpha=0.001, beta=0.001) @deterministic def mu(beta0=beta0, beta1=beta1): return beta0 + men[:,0]*beta1 likelihood = pymc.distributions.Normal("likelihood", mu=mu, tau=tau, value=men[:,1], observed=True) model = pymc.Model([beta0, beta1, tau]) map_ = pymc.MAP(model) map_.fit() mcmc = pymc.MCMC(model) mcmc.sample(iter=40000, burn=10000) pymc.Matplot.plot(mcmc)
The convergence of the posterior traces look much better now, e.g. for :
The only thing left to do is to “un-standardize” the posterior distributions to express the values in weight and weight/height for the intercept and slope respectively:
beta0_samples = mcmc.trace('beta0')[:] beta1_samples = mcmc.trace('beta1')[:] beta0_unstd = beta0_samples*sd_weight + mean_weight - beta1_samples*sd_weight*mean_height/sd_height beta1_unstd = beta1_samples*sd_weight/sd_height print "Unstandardized beta0 HDI: " + ",".join(map(str, pymc.utils.hpd(beta0_unstd , 1.- 0.95))) print "Unstandardized beta1 HDI: " + ",".join(map(str, pymc.utils.hpd(beta1_unstd , 1.- 0.95)))
Unstandardized beta0 HDI: -0.187313759236,0.194719066091 Unstandardized beta1 HDI: 0.0833251532747,0.471270731396
Since the Highest Density Interval doesn’t include 0 for , we can state that there is indeed a linear relation between weight and height and we can use one to predict the other.