This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.
tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.
Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:
Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.
That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.
According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:
So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.
Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.
We might be interested in answering the following question: given that a user has the add-ons , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:
which corresponds to the ratio between Telemetry submissions that contain and the submissions that contain all those add-ons but B. Note that if
then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that
Since the denominator is constant for any B, we don’t really care about it.
Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.
If we make the naive assumptions that the add-ons are conditionally independent, i.e. that
we can rewrite eq. 2 like so
Awesome, this means that once we calculate the probabilities
where is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just multiplications.
If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.
Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.
Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form
where is the amplitude, is the frequency and is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise :
where , , , and are random variables and
i.e. we assume the variance of the noise is the same for all samples.
Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.
where the functions are probability density functions.
We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that , , and are independent then
where , , and are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.
Assuming the measurements are independent from each other, the likelihood function for the vector of measurements is
Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily
To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.
We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.