You have just read the most recent research finding about how snails sleep patterns can predict earthquakes. Even though it seems extremely unlikely that this could be the case it must be since the scientist claim to have a statistical significant relation. But is there a real actual relation?

The real issue of why frequentist methods are difficult to truly grasp, relies ultimately in the fact the they give you back a conditional probability of the data you have given a certain hypothesis. For instance, in frequentist hypothesis testing, you pose a question like the following one: “Assuming a certain hypothesis, what’s the probability of seeing the data I have?”. But what you would really like to know instead is the opposite: “Given that I have seen some data, what’s the probability of my hypothesis?”

But so what, you may ask. Is the difference really that important? It turns out it can be in certain scenarios. Suppose you have a tool that performs static analysis of your codebase with an accuracy of 80% of correctly identifying a bug in a buggy statement and an accuracy of 95% of not reporting a bug in a non buggy statement (where 100% – 95% = 5% is the so called level). So we happily run the analyzer on a small module of our codebase and it returns 6 bugs. After further inspection though it turns out that 5 of them weren’t really bugs, but just “false positives”. In other words, given that the system identified a bug, the probability of it being a real bug is below 20%. How is it possible that we have so many false positives w.r.t. the true positives?

We forgot to take under consideration the so called “prior”, which is the probability of having a bug in the first place. If that probability is very low, then most of our detected bugs will be false positives. Taken to the extreme, imagine to have a prior probability of 0, then all the bugs your analyzer finds are going to be false positives and you would end up with an error rate of 100%!

How can we rigorously define a prior probability? We really can’t, and here is where the subtle difference between the Frequentist and Bayesian point of view lies, while the former gives up the latter pragmatically assigns a “degree of belief” to the probability of having a bug in the first place. For instance, you could assign a probability of 1% to a module that you know is very likely to don’t contain bugs since it is in maintenance mode from 1980 and most issues have been solved.

We have now defined 3 probabilities, the one of correctly classifying a bug as such of 80% (C|B), the one of correctly classifying a non-bug as such of 95% (¬C|¬B), and the one of having a bug in the first place, of 1% (B). We can use Bayes theorem to derive the probability of the code having a bug given that the analyzer told us so (B|C):

This means that even though we have a small level of 5%, the probability that we actually have a bug given that the analyzer is telling us so, is only about 14%!

What does this all mean for the research industry? It means that the level and p-values not necessarily tell us much about the probability we really care about, that is the probability of our hypothesis given the data.

Consider the strip below where scientists are investigating a link between Jelly Beans and Acne. Say we know that the prior is practically 0 so there is no link at all. If the scientists repeat the test many times with different samples, eventually they will find one that is statistical significant and exhibits a p-value below the level of 5%. And there you have it, if you try hard enough with many different samples eventually you will find one that exhibits a statistical significant link between any two quantities.

(xkcd, not Dilbert)