Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

P(H|D) = \frac{P(D|H)*P(H)}{P(D)}

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.

Logic

According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

P(A=F|B=F) = 1 - P(A=T|B=F)

= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1

where

P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons A_{1}, ..., A_{k} , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have A_{1}, ..., A_{k} . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}   eq. 1

which corresponds to the ratio between Telemetry submissions that contain A_{1}, ..., A_{k}, B and the submissions that contain all those add-ons but B. Note that if

B \in \{A_{1}, ..., A_{k}\}

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

B \not\in \{A_{1}, ..., A_{k}\}

Since the denominator is constant for any B, we don’t really care about it.

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)   eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)

we can rewrite eq. 2 like so

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)

Awesome, this means that once we calculate the N probabilities

P(B|A_{i})

where N is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just k multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

y(t) = \alpha sin(2\pi \omega t + \varphi)

where \alpha is the amplitude, \omega is the frequency and \varphi is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise \Sigma_i :

Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i

where Y_i , A , \Omega , \Phi and W_i are random variables and

W_i = \mathcal{N}(0, \sigma^2)

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac  {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)}  {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}

where the functions f are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that A , \Omega , \Phi and \Sigma are independent then

f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)

where f_{A} , f_{\Omega} , f_{\Phi} and f_{\Sigma} are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements y is

f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s