Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

P(H|D) = \frac{P(D|H)*P(H)}{P(D)}

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.

Logic

According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

P(A=F|B=F) = 1 - P(A=T|B=F)

= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1

where

P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons A_{1}, ..., A_{k} , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have A_{1}, ..., A_{k} . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}   eq. 1

which corresponds to the ratio between Telemetry submissions that contain A_{1}, ..., A_{k}, B and the submissions that contain all those add-ons but B. Note that if

B \in \{A_{1}, ..., A_{k}\}

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

B \not\in \{A_{1}, ..., A_{k}\}

Since the denominator is constant for any B, we don’t really care about it.

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)   eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)

we can rewrite eq. 2 like so

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)

Awesome, this means that once we calculate the N probabilities

P(B|A_{i})

where N is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just k multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

y(t) = \alpha sin(2\pi \omega t + \varphi)

where \alpha is the amplitude, \omega is the frequency and \varphi is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise \Sigma_i :

Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i

where Y_i , A , \Omega , \Phi and W_i are random variables and

W_i = \mathcal{N}(0, \sigma^2)

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac  {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)}  {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}

where the functions f are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that A , \Omega , \Phi and \Sigma are independent then

f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)

where f_{A} , f_{\Omega} , f_{\Phi} and f_{\Sigma} are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements y is

f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.

Why most research findings are false

You have just read the most recent research finding about how snails sleep patterns can predict earthquakes. Even though it seems extremely unlikely that this could be the case it must be since the scientist claim to have a statistical significant relation. But is there a real actual relation?

The real issue of why frequentist methods are difficult to truly grasp, relies ultimately in the fact the they give you back a conditional probability of the data you have given a certain hypothesis. For instance, in frequentist hypothesis testing, you pose a question like the following one: “Assuming a certain hypothesis, what’s the probability of seeing the data I have?”. But what you would really like to know instead is the opposite: “Given that I have seen some data, what’s the probability of my hypothesis?”

But so what, you may ask. Is the difference really that important? It turns out it can be in certain scenarios. Suppose you have a tool that performs static analysis of your codebase with an accuracy of 80% of correctly identifying a bug in a buggy statement and an accuracy of 95% of not reporting a bug in a non buggy statement (where 100% – 95% = 5% is the so called \alpha level). So we happily run the analyzer on a small module of our codebase and it returns 6 bugs. After further inspection though it turns out that 5 of them weren’t really bugs, but just “false positives”. In other words, given that the system identified a bug, the probability of it being a real bug is below 20%. How is it possible that we have so many false positives w.r.t. the true positives?
We forgot to take under consideration the so called “prior”, which is the probability of having a bug in the first place. If that probability is very low, then most of our detected bugs will be false positives. Taken to the extreme, imagine to have a prior probability of 0, then all the bugs your analyzer finds are going to be false positives and you would end up with an error rate of 100%!
How can we rigorously define a prior probability? We really can’t, and here is where the subtle difference between the Frequentist and Bayesian point of view lies, while the former gives up the latter pragmatically assigns a “degree of belief” to the probability of having a bug in the first place. For instance, you could assign a probability of 1% to a module that you know is very likely to don’t contain bugs since it is in maintenance mode from 1980 and most issues have been solved.
We have now defined 3 probabilities, the one of correctly classifying a bug as such of 80% (C|B), the one of correctly classifying a non-bug as such of 95% (¬C|¬B), and the one of having a bug in the first place, of 1% (B). We can use Bayes theorem to derive the probability of the code having a bug given that the analyzer told us so (B|C):

P(B|C) = \frac{P(C|B)*P(B)}{P(C|B)*P(B) + P(C|\neg B)P(\neg B)} \approx 14\%

This means that even though we have a small \alpha level of 5%, the probability that we actually have a bug given that the analyzer is telling us so, is only about 14%!

What does this all mean for the research industry? It means that the \alpha level and p-values not necessarily tell us much about the probability we really care about, that is the probability of our hypothesis given the data.

Consider the strip below where scientists are investigating a link between Jelly Beans and Acne. Say we know that the prior is practically 0 so there is no link at all. If the scientists repeat the test many times with different samples, eventually they will find one that is statistical significant and exhibits a p-value below the \alpha level of 5%. And there you have it, if you try hard enough with many different samples eventually you will find one that exhibits a statistical significant link between any two quantities.

xkcd-significant