# Differential Privacy for Dummies

Technology allows companies to collect more data and with more detail about their users than ever before. Sometimes that data is sold to third parties, other times it’s used to improve products and services.

In order to protect users’ privacy, anonymization techniques can be used to strip away any piece of personally identifiable data and let analysts access only what’s strictly necessary. As the Netflix competition in 2007 has shown though, that can go awry. The richness of data allows to identify users through a sometimes surprising combination of variables like the dates on which an individual watched certain movies. A simple join between an anonymized datasets and one of many publicly available, non-anonymized ones, can re-identify anonymized data.

Aggregated data is not much safer either under some circumstances! For example, say we have two summary statistics: one is the number of users, including Frank, that watch one movie per day and the other is the number of users, without Frank, that watch one movie per day. Then, by comparing the counts, we could tell if Frank watches one movie per day.

#### Differential Privacy to the rescue

Differential privacy formalizes the idea that a query should not reveal whether any one person is present in a dataset, much less what their data are. Imagine two otherwise identical datasets, one with your information in it, and one without it. Differential Privacy ensures that the probability that a query will produce a given result is nearly the same whether it’s conducted on the first or second dataset. The idea is that if an individual’s data doesn’t significantly affect the outcome of a query, then he might be OK in giving his information up as it is unlikely that the information would be tied back to him. The result of the query can damage an individual regardless of his presence in a dataset though. For example, if an analysis on a medical dataset finds a correlation between lung cancer and smoking, then the health insurance cost for a particular smoker might increase regardless of his presence in the study.

More formally, differential privacy requires that the probability of a query producing any given output changes by at most a multiplicative factor when a record (e.g. an individual) is added or removed from the input. The largest multiplicative factor quantifies the amount of privacy difference. This sounds harder than it actually is and the next sections will iterate on the concept with various examples, but first we need to define a few terms.

Dataset

We will think of a dataset $d$ as being a collection of records from an universe $U$. One way to represent a dataset $d$ is with a histogram in which each entry $d_i$ represents the number of elements in the dataset equal to $u_i \in U$. For example, say we collected data about coin flips of three individuals, then given the universe $U = \{head, tail\}$, our dataset $d$ would have two entries: $d_{head} = i$ and $d_{tail} = j$, where $i + j = 3$. Note that in reality a dataset is likely to be an ordered lists of rows (i.e. a table) but the former representation makes the math a tad easier.

Distance

Given the previous definition of dataset, we can define the distance between two datasets $x, y$ with the $l_1$ norm as:

$\displaystyle ||x - y||_1 = \sum\limits_{i=1}^{|U|} |x_i - y_i|$

Mechanism

A mechanism is an algorithm that takes as input a dataset and returns an output, so it can really be anything, like a number, a statistical model or some aggregate. Using the previous coin-flipping example, if mechanism $C$ counts the number of individuals in the dataset, then $C(d) = 3$. In practice though we will specifically considering randomized mechanisms, where the randomization is used to add privacy protection.

Differential Privacy

A mechanism $M$ satisfies $\epsilon$ differential privacy if for every pair of datasets $x, y$ such that $||x - y||_1 \leq 1$, and for every subset $S \subseteq \text{Range}(M)$:

$\displaystyle \frac{Pr[M(x) \in S]}{Pr[M(y) \in S]} \leq e^{\epsilon}$

What’s important to understand is that the previous statement is just a definition. The  definition  is  not  an  algorithm,  but  merely  a  condition that must be satisfied by a mechanism to claim that it satisfies $\epsilon$ differential privacy. Differential privacy allows researchers to use a common framework to study algorithms and compare their privacy guarantees.

Let’s check if our mechanism $C$ satisfies $1$ differential privacy. Can we find a counter-example for which:

$\displaystyle \frac{Pr[C(x) \in S]}{Pr[C(y) \in S]} \leq e$

is false? Given $x, y$ such that $||x - y||_1 = 1$ and $||x||_1 = k$, then:

$\displaystyle \frac{Pr[C(x) = k]}{Pr[C(y) = k]} \leq e$

i.e. $\frac{1}{0} \leq e$, which is clearly false, hence this proves that mechanism $C$ doesn’t satisfy $1$ differential privacy.

Composition theorems

A powerful property of differential privacy is that mechanisms can easily be composed. These require the key assumption that the mechanisms operate independently given the data.

Let $d$ be a dataset and $g$ an arbitrary function. Then, the sequential composition theorem asserts that if $M_I(d)$ is $\epsilon_i$ differentially private, then $M(d) = g(M_1(d), M_2(d), ..., M_N(d))$ is $\sum_{i=1}^{N}\epsilon_i$ differentially private. Intuitively this means that given an overall fixed privacy budget, the more mechanisms are applied to the same dataset, the more the available privacy budget for each individual mechanism will decrease.

The parallel composition theorem asserts that given $N$ partitions of a dataset $d$, if for an arbitrary partition $d_i$, $M_i(d_i)$ is $\epsilon$ differentially private, then $M(d) = g(M_1(d_1), M_2(d_2), ..., M_N(d_N))$ is $\epsilon$ differentially private. In other words, if a set of $\epsilon$ differentially private mechanisms is applied to a set of disjoint subsets of a dataset, then the combined mechanism is still $\epsilon$ differentially private.

#### The randomized response mechanism

The first mechanism we will look into is “randomized response”, a technique developed in the sixties by social scientists to collect data about embarrassing or illegal behavior. The study participants have to answer a yes-no question in secret using the following mechanism $M_R(d, \alpha, \beta)$:

1. Flip a biased coin with probability of heads $\alpha$;
2. If heads, then answer truthfully with $d$;
3. If tails, flip a coin with probability of heads $\beta$ and answer “yes” for heads and “no” for tails.

In code:

def randomized_response_mechanism(d, alpha, beta):
if random() < alpha:
return d
elif random() < beta:
return 1
else:
return 0


Privacy is guaranteed by the noise added to the answers. For example, when the question refers to some illegal activity, answering “yes” is not incriminating as the answer occurs with a non-negligible probability whether or not it reflects reality, assuming $\alpha$ and $\beta$ are tuned properly.

Let’s try to estimate the proportion $p$ of participants that have answered “yes”. Each participant can be modeled with a Bernoulli variable $X_i$ which takes a value of 0 for “no” and a value of 1 for “yes”. We know that:

$\displaystyle P(X_i = 1) = \alpha p + (1 - \alpha) \beta$

Solving for $p$ yields:

$\displaystyle p = \frac{P(X_i = 1) - (1 - \alpha) \beta}{\alpha}$

Given a sample of size $n$, we can estimate $P(X_i = 1)$ with $\frac{\sum_{i=1}^{i=n} X_i}{n}$.  Then, the estimate $\hat{p}$ of $p$ is:

$\displaystyle \hat{p} = \frac{\frac{\sum\limits_{i=1}^{i=n} X_i}{n} - (1 - \alpha) \beta}{\alpha}$

To determine how accurate our estimate is we will need to compute its standard deviation. Assuming the individual responses $X_i$ are independent, and using basic properties of the variance,

$\displaystyle \mathrm{Var}(\hat{p}) = \mathrm{Var}\biggl({\frac{\sum_{i=1}^{i=n} X_i}{n \alpha}}\biggr) = {\frac{\mathrm{Var}(X_i)}{n \alpha^2}}$

By taking the square root of the variance we can determine the standard deviation of $\hat{p}$. It follows that the standard deviation $s$ is proportional to $\frac{1}{\sqrt{n}}$, since the other factors are not dependent on the number of participants. Multiplying both $\hat{p}$ and $s$ by $n$ yields the estimate of the number of participants that answered “yes” and its relative accuracy expressed in number of participants, which is proportional to $\sqrt{n}$.

The next step is to determine the level of privacy that the randomized response method guarantees. Let’s pick an arbitrary participant. The dataset $d$ is represented with either 0 or 1 depending on whether the participant answered truthfully with a “no” or “yes”. Let’s call the two possible configurations of the dataset respectively $d_{no}$ and $d_{yes}$. We also know that $||d_i - d_j||_1 \leq 1$ for any $i, j$. All that’s left to do is to apply the definition of differential privacy to our randomized response mechanism $M_R(d, \alpha, \beta)$:

$\displaystyle \frac{Pr[M_R(d_{i}, \alpha, \beta) = k]}{Pr[M_R(d_{j}, \alpha, \beta) = k]} \leq e^{\epsilon}$

The definition of differential privacy applies to all possible configurations of $i, j \text{ and } k$, e.g.:

$\displaystyle \begin{gathered}\frac{Pr[M_R(d_{yes}, \alpha, \beta) = 1]}{Pr[M_R(d_{no}, \alpha, \beta) = 1]} \leq e^{\epsilon} \\\ln \left( {\frac{\alpha + (1 - \alpha)\beta}{1 - (\alpha + (1 - \alpha)\beta)}} \right) \leq \epsilon \\\ln \left( \frac{\alpha + \beta - \alpha \beta}{1 - (\alpha + \beta - \alpha \beta)} \right) \leq \epsilon\end{gathered}$

The privacy parameter $\epsilon$ can be tuned by varying $\alpha \text{ and } \beta$. For example, it can be shown that the randomized response mechanism with $\alpha = \frac{1}{2}$ and $\beta = \frac{1}{2}$ satisfies $\ln{3}$ differential privacy.

The proof applies to a dataset that contains only the data of a single participant, so how does this mechanism scale with multiple participants? It follows from the parallel composition theorem that the combination of $\epsilon$ differentially private mechanisms applied to the datasets of the individual participants is $\epsilon$ differentially private as well.

#### The Laplace mechanism

The Laplace mechanism is used to privatize a numeric query. For simplicity we are going to assume that we are only interested in counting queries $f$, i.e. queries that count individuals, hence we can make the assumption that adding or removing an individual will affect the result of the query by at most 1.

The way the Laplace mechanism works is by perturbing a counting query $f$ with noise distributed according to a Laplace distribution centered at 0 with scale $b = \frac{1}{\epsilon}$,

$\displaystyle Lap(x | b) = \frac{1}{2b}e^{- \frac{|x|}{b}}$

Then, the Laplace mechanism is defined as:

$\displaystyle M_L(x, f, \epsilon) = f(x) + Z$

where $Z$ is a random variable drawn from $Lap(\frac{1}{\epsilon})$.

In code:

def laplace_mechanism(data, f, eps):
return f(data) + laplace(0, 1.0/eps)


It can be shown that the mechanism preserves $\epsilon$ differential privacy. Given two datasets $x, y$ such that $||x - y||_1 \leq 1$ and a function $f$ which returns a real number from a dataset, let $p_x$ denote the probability density function of $M_L(x, f, \epsilon)$ and $p_y$ the probability density function of $M_L(y, f, \epsilon)$. Given an arbitrary real point $k$,

$\displaystyle \frac{p_x(k)}{p_y(k)} = \frac{e^{- \epsilon |f(x) - k|}}{e^{- \epsilon |f(y) - k|}} =$

$\displaystyle e^{\epsilon (|f(x) - k| - |f(y) - k|)} \leq$

$\displaystyle e^{\epsilon |f(x) - f(y)|}$

by the triangle inequality. Then,

$\displaystyle e^{\epsilon |f(x) - f(y)|} \leq e^\epsilon$

What about the accuracy of the Laplace mechanism? From the cumulative distribution function of the Laplace distribution it follows that if $Z \sim Lap(b)$, then $Pr[|Z| \ge t \times b] = e^{-t}$. Hence, let $k = M_L(x, f, \epsilon)$ and $\forall \delta \in (0, 1]$:

$\displaystyle Pr \left[|f(x) - k| \ge \ln{\left (\frac{1}{\delta} \right)} \times \frac{1}{\epsilon} \right] =$

$\displaystyle Pr \left[|Z| \ge \ln{\left (\frac{1}{\delta} \right)} \times \frac{1}{\epsilon} \right] = \delta$

where $Z \sim Lap(\frac{1}{\epsilon})$. The previous equation sets a probalistic bound to the accuracy of the Laplace mechanism that, unlike the randomized response, does not depend on the number of participants $n$.

#### Counting queries

The same query can be answered by different mechanisms with the same level of differential privacy. Not all mechanisms are born equally though; performance and accuracy have to be taken into account when deciding which mechanism to pick.

As a concrete example, let’s say there are $n$ individuals and we want to implement a query that counts how many possess a certain property $p$. Each individual can be represented with a Bernoulli random variable:

participants = binomial(1, p, n)


We will implement the query using both the randomized response mechanism $M_R(d, \frac{1}{2}, \frac{1}{2})$, which we know by now to satisfy $\ln{3}$ differential privacy, and the Laplace mechanism $M_L(d, f, \ln{3})$ which satisfies $\ln{3}$ differential privacy as well.

def randomized_response_count(data, alpha, beta):
randomized_data = randomized_response_mechanism(data, alpha, beta)
return len(data) * (randomized_data.mean() - (1 - alpha)*beta)/alpha

def laplace_count(data, eps):
return laplace_mechanism(data, np.sum, eps)

r = randomized_response_count(participants, 0.5, 0.5)
l = laplace_count(participants, log(3))


Note that while that while $M_R$ is applied to each individual response and later combined in a single result, i.e. the estimated count, $M_L$ is applied directly to the count, which is intuitively why $M_R$ is noisier than $M_L$. How much noisier?  We can easily simulate the distribution of the accuracy for both mechanisms with:

def randomized_response_accuracy_simulation(data, alpha, beta, n_samples=1000):
return [randomized_response_count(data, alpha, beta) - data.sum()
for _ in range(n_samples)]

def laplace_accuracy_simulation(data, eps, n_samples=1000):
return [laplace_count(data, eps) - data.sum()
for _ in range(n_samples)]

r_d = randomized_response_accuracy_simulation(participants, 0.5, 0.5)
l_d = laplace_accuracy_simulation(participants, log(3))


As mentioned earlier, the accuracy of $M_R$ grows with the square root of the number of participants:

while the accuracy of $M_L$ is a constant:

You might wonder why one would use the randomized response mechanism if it’s worse in terms of accuracy compared to the Laplace one. The thing about the Laplace mechanism is that the private data about the users has to be collected and stored, as the noise is applied to the aggregated data. So even with the best of intentions there is the remote possibility that an attacker might get access to it.  The randomized response mechanism though applies the noise directly to the individual responses of the users and so only the perturbed responses are collected! With the latter mechanism any individual’s information cannot be learned with certainty, but an aggregator can still infer population statistics.

That said, the choice of mechanism is ultimately a question of which entities to trust. In the medical world, one may trust the data collectors (e.g. researchers), but not the general community who will be accessing the data. Thus one collects the private data in the clear, but then derivatives of it are released on request with protections. However, in the online world, the user is generally looking to protect their data from the data collector itself, and so there is a need to prevent the data collector from ever accumulating the full dataset in the clear.

#### Real world use-cases

The algorithms presented in this post can be used to answer simple counting queries. There are many more mechanisms out there used to implement complex statistical procedures like machine learning models. The concept behind them is the same though: there is a certain function that needs to be computed over a dataset in a privacy preserving manner and noise is used to mask an individual’s original data values.

One such mechanism is RAPPOR, an approach pioneered by Google to collect frequencies of an arbitrary set of strings. The idea behind it is to collect vectors of bits from users where each bit is perturbed with the randomized response mechanism.  The bit-vector might represent a set of binary answers to a group of questions, a value from a known dictionary or, more interestingly, a generic string encoded through a Bloom filter. The bit-vectors are aggregated and the expected count for each bit is computed in a similar way as shown previously in this post. Then, a statistical model is fit to estimate the frequency of a candidate set of known strings. The main drawback with this approach is that it requires a known dictionary.

Later on the approach has been improved to infer the collected strings without the need of a known dictionary at the cost of accuracy and performance. To give you an idea, to estimate a distribution  over  an  unknown  dictionary  of  6-letter  strings without knowing  the  dictionary,  in  the  worst  case,  a sample size in  the  order  of  300  million  is required; the sample size grows quickly as the length of the strings increases. That said, the mechanism consistently finds the most frequent strings which enable to learn the dominant trends of a population.

Even though the theoretical frontier of differential privacy is expanding quickly there are only a handful implementations out there that, by ensuring  privacy without the need for a trusted third party like RAPPOR, suit well the kind of data collection schemes commonly used in the software industry.

References

# Data Analysis Review Checklist

Writing good code is hard, writing a good analysis is harder. Peer-review is an essential tool to fight repetitive errors, omissions and more generally divulge knowledge. I found the use of a checklist to be invaluable to help me remember the most important things I should watch out for during a review. It’s far too easy to focus on few details and ignore others which might be catched (or not) in a successive round.

I don’t religiously apply every bullet point of the following checklist to every analysis, nor is this list complete; more items would have to be added depending on the language, framework, libraries, models, etc. used.

• Is the question the analysis should answer clearly stated?
• Is the best/fastest dataset that can answer the question being used?
• Do the variables used measure the right thing (e.g. submission date vs activity date)?
• Is a representative sample being used?
• Are all data inputs checked (for the correct type, length, format, and range) and encoded?
• Do outliers need to be filtered or treated differently?
• Is seasonality being accounted for?
• Is sufficient data being used to answer the question?
• Are comparisons performed with hypotheses tests?
• Are estimates bounded with confidence intervals?
• Should the results be normalized?
• If any statistical method is being used, are the assumptions of the model met?
• Is correlation confused with causation?
• Does each plot communicate an important piece of information or address a question of interest?
• Are legends and axes labelled and do the they start from 0?
• Is the analysis easily reproducible?
• Does the code work, i.e. does it perform its intended function?
• Is there a more efficient way to solve the problem, assuming performance matters?
• Does the code read like prose?
• Does the code conform to the agreed coding conventions?
• Is there any redundant or duplicate code?
• Is the code as modular as possible?
• Can any global variables be replaced?
• Is there any commented out code and can it be removed?
• Is logging missing?
• Can any of the code be replaced with library functions?
• Can any debugging code be removed?
• Where third-party utilities are used, are returning errors being caught?
• Is any public API commented?
• Is any unusual behavior or edge-case handling described?
• Is there any incomplete code? If so, should it be removed or flagged with a suitable marker like ‘TODO’?
• Is the code easily testable?
• Do tests exist and do they actually test that the code is performing the intended functionality?

# What is it like to be a Data Engineer at Mozilla?

When people ask me what I do at Mozilla and I tell them I work with data, more often than not their reaction is something like “Wait, what? What kind of data could Mozilla possibly have?” But let’s start from the beginning of my journey. Even though I have worked on data systems since my first internship, I didn’t start my career at Mozilla officially as a Data Engineer. I was hired to track, investigate, and fix performance problems in the Firefox browser; an analyst of sorts, given my background in optimizing large scientific applications for the ATLAS experiment at the LHC.

The thing with applications for which you have some control over how they are used is that you can “just” profile your known use cases and go from there. Profiling a browser is an entirely different story. Every user might stress the browser in different ways and you can’t just sample stacks and indiscriminately collect data from your users without privacy and performance consequences. If that wasn’t enough, there are as many different hardware/software configurations as there are users. Even with the same configuration, a different driver version can be make a huge difference. This is where Telemetry comes in, an opt-in data acquisition system that collects general performance metrics like garbage collection timings and hardware information, like the GPU model, across our user-base. In case you are wondering what data I am talking about, why don’t you have a look? You are just one click away from about:telemetry, a page that shows the data that Telemetry collects.

Imagine receiving Terabytes of data with millions of payloads per day, each containing thousands of complex hierarchical measurements, and being tasked to make sense of that data, to find patterns in it, how would you do it? Step 1 is getting the data, how hard can it possibly be, right? In order to do that, I had to fire a script to launch an EC2 instance to which I had to ssh into and from which I could run a custom, non-distributed, MapReduce (MR) system written in Python. This was a couple of years ago; you might be wondering why anyone would use a custom MR system instead of something like Hadoop back then. It turns out that it was simple enough to support the use cases the performance team had at the time and when something went wrong it was easy to debug. Trust me, debugging a full-fledged distributed system can be painful.

The system was very simple, the data was partitioned on S3 according to a set of dimensions, like submission date, and the MR framework allowed the user to define the which values to use for each dimension. My first assignment was to get back the ratio of users that had a solid state disk. Sounds simple enough, yet it took me a while to get the data I wanted out of the system. The debug cycle was very time consuming; I had to edit my job, try to run it and if something went wrong start all over again. An analysis for a single day could take hours to complete and, as it’s often the case with data, if something that was supposed to be a string turned out be an integer because of data corruption, and I failed to handle that case, the whole thing exploded.

There weren’t many people running Telemetry analyses back then; it was just too painful. In time, as more metrics were added and I tried to answer more questions with data, I decided to step back for a moment and try something new. I played around with different systems like Pig, mrjob, and Spark. Though Spark has a rich API with far more operators than map & reduce, what ultimately sold me was the fact that it had a read–eval–print loop. If your last command failed, you could just retry it without having to rerun the whole job from the beginning! I initially used a Scala API to run my analyses but I soon started missing some of the Python libraries I grew used to for analysis, like pandas, scipy and matplotlib; the Scala counterpart just weren’t on par with it. I wanted to bring others on board and convincing someone to write analyses in a familiar language was a lot easier. So I built a tiny wrapper around Spark’s Python API which lets user run queries on our data from a Jupyter notebook and in time the number of people using Telemetry for their analytics needs grew.

I deliberately ignored many design problems that had to be solved. What storage format should be used? How should the data be partitioned? How should the cluster(s) be provisioned; EMR, ECS or something else entirely? How should recurrent jobs be scheduled?

Telemetry is one of the most complex data sources we have but there are many others, like the server logs from a variety of services. Others were solving similar problems as the ones I faced but with different approaches, so we joined forces. Fast-forward a couple of years and Telemetry has changed a lot. A Heka pipeline pre-processes and validates the Telemetry data before dumping it on S3. Scala jobs convert the raw JSON data in more manageable derived Parquet datasets that can then be read by Spark with Python, Scala or R. The scope of Telemetry has expanded to drive not only performance needs but business ones as well. Derived datasets can be queried with SQL through Presto and re:dash allows us to build and share dashboard among peers. Jobs can be scheduled and monitored with Airflow. Telemetry based analyses are being run at all level of the organization: no longer are few people responsible to answer every imaginable data related question: there are interns running statistically rigorous analyses on A/B experiments and Senior Managers building retention plots for their products. Analyses are being shared as Jupyter notebooks and peer-reviewed on Bugzilla. It’s every engineer’s responsibility to know how to measure how well its own feature or product is performing.

There is this popular meme about how Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. The truth is that there are only so many companies out there that can claim a user base with millions of users. This is one of the thing I love the most about Mozilla, we aren’t big as the giants but we aren’t small either. As an engineer, I get to solve challenging problems with the agility, flexibility and freedom of a startup. Our engineers work on interesting projects and there is a lot of room for trying new things, disrupting the status quo and pushing our comfort zone.

What’s the life of a Data engineer at Mozilla nowadays? How do we spend our time and what are our responsibilities?

#### Building the data pipeline

We build, maintain, monitor and deploy our data infrastructure and make sure that it works reliably and efficiently. We strive to use the smallest number of technologies that can satisfy our needs and work well in concert.

What kind of data format (e.g. Avro, Parquet) and compression codec (e.g. Snappy, LZO) should be used for the data? Should the data be stored on HDFS or S3? How should it be partitioned and bucketed? Where should the metadata (e.g. partition columns) be stored? Should the data be handled in real-time or in batch? Should views be recomputed incrementally or from scratch? Does summary data need to be accurate or can probabilistic data structures be used? Those are just some of the many decisions a data engineer is expected to take.

We contribute to internal projects. Heka for example, a stream processing system, is used in a multitude of ways: from loading and parsing log files to performing real time analysis, graphing, and anomaly detection.

At Mozilla, we love open source. When we built our stream processing system, we open sourced it. When different data systems like Spark or Presto need some custom glue to make them play ball, we extend them to fit our needs. Everything we do is out in the open; many of our discussions happen either on Bugzilla and on public mailing lists. Whatever we learn we share with the outside world.

#### Supporting other teams

Engineers, managers, and executives alike come to us with questions that they would like to answer with data but don’t know yet quite how to. We try to understand which are the best metrics that can answer their questions, assuming they can be answered in the first place. If the metrics required are missing then we add them to our pipeline. This could mean adding instrumentation to Firefox or, if the desired metrics can be derived from simpler ones that are available in the raw data, extending our derived datasets.

Once the data is queryable, we teach our users how to access it with e.g. Spark, Presto or Heka, depending on where the dataset is stored and what their requirements are. We learn from common patterns and inefficiencies by supporting our users’ analysis needs. For example, if a considerable amount queries require to group a dataset in a certain way, we might opt to build a pre-grouped version of the dataset to improve the performance.

We strive to review many of our users’ analyses. It’s far too easy to make mistakes when analyzing data since the final deliverables, like plots, can look reasonable even in the presence of subtle bugs.

#### Doing data science

We build alerting mechanisms that monitor our data to warn us when something is off. From representing time-series with hierarchical statistical models to using SAX for anomaly detection, the fun never stops.

We create dashboards for general use cases, like the display of daily aggregates partitioned by a set of dimensions or to track engagement ratios of products & features.

We validate hypotheses with A/B tests. For example, one of the main features we are going to ship this year, Electrolysis, requires a clear understanding of its performance impact before shipping it. We designed and ran A/B tests to compare with statistically sound methods a vast variety of performance metrics with and without Electrolysis enabled.

There is also no shortage of analyses that are more exploratory in nature. For example, we recently noticed a small spike in crashes whenever a new release is shipped. Is it a measurement artifact due to crashes that happened during the update? Or maybe there is something off that causes Firefox to crash after a new version is installed, and if so what is it? We don’t have an answer to this conundrum yet.

The simpler it is to get answers, the more questions will be asked.

This has been my motto for a while. Being data driven doesn’t mean having a siloed team of wizards and engineers that poke at data. Being data driven means that everyone in the company knows how to access the data they need in order to answer the questions they have; that’s the ultimate goal of a data engineering team. We are getting there but there is still a lot to do. If you have read so far, are passionate about data and have relevant skills, check out our jobs page.

# Measuring product engagment at scale

How engaged are our users for a certain segment of the population? How many users are actively using a new feature? One way to answer that question is to compute the  engagement ratio (ER) for that segment, which is defined as daily active users (DAU) over monthly active users (MAU), i.e.

$ER_{segment} = \frac{DAU_{segment}}{MAU_{segment}}$

Intuitively the closer the ratio is to 1, the higher the number of returning users. A segment can be an arbitrary combination of values across a set of dimensions, like operating system and activity date.

Ideally, given a set of $D$ dimensions, pre-computing the ER for all possible combinations of values of said dimensions would ensure that user queries run as fast as possible. Clearly that’s not very efficient though; if we assume for simplicity that every dimension has $k$ possible values, then there are $\sum_{d=1}^{D}\binom{D}{d} k^d$ ratios.

Is there a way around computing all of those ratios while still having an acceptable query latency? One could build a set of users for each value of each dimension and then at query time simply use set union and set intersection to compute any desired segment. Unfortunately that doesn’t work when you have millions of users as the storage complexity is proportional to the number of items stored in the sets. This is where probabilistic counting and the HyperLogLog sketch comes into play.

HyperLogLog

The HyperLogLog sketch can estimate cardinalities well beyond $10^9$ with a standard error of 2% while only using a couple of KB of memory.

The intuition behind is simple. Imagine a hash function that maps user identifiers to a string of N bits. A good hash function ensures that each bit has the same probability of being flipped. If that’s the case then the following is true:

• 50 % of hashes have the prefix 1
• 25 % of hashes have the prefix 01
• 12.5% of hashes the prefix 001

Intuitively, if 4 distinct values are hashed we would expect to see on average one single hash with a prefix of 01 while for 8 distinct values we would expect to see one hash with a prefix of 001 and so on. In other words, the cardinality of the original set can be estimated from the longest prefix of zeros of the hashed values. To reduce the variability of this single estimator, the average of K estimators can be used as the approximated cardinality and it can be shown that the standard error of a HLL sketch is $\frac{1.04}{sqrt(K)}$.

The detailed algorithm is documented in the original paper, and its practical implementation and variants are covered in depth by a 2013 paper from Google.

Set operations

One of the nice properties of HLL is that the union of two HLL sketches is very simple to compute as the union of a single estimator with another estimator is just the maximum of the two estimators, i.e. the longest prefix of zeros. This property makes the algorithm trivially parallelizable which makes it well suited for map-reduce style computations.

What about set intersection? There are two ways to compute that for e.g. two sets:

• Using the inclusion-exclusion principle: $|A \cap B| = |A| + |B| - |A \cup B|$.
• Using the MinHash (MH) sketch, which estimates the Jaccard index that measures how similar two sets are: $\frac{|A \cap B|}{|A \cup B|}$. Given the MH sketch one could estimate the intersection with $|A \cap B| \approx MH(A, B) \times |A \cup B|$.

It turns out that both approaches yield bad approximations when the overlap is small, which makes set intersection not particularly attractive for many use-cases, which is why we decided to use only the union operation for HLL sketches for our datasets.

Going back to the original problem of estimating the ER for an arbitrary segment, by computing the HLL sketches for all combinations of values across all dimensions the HLL sketch for any segment can be derived using set union.

Implementation

At Mozilla we use both Spark and Presto for analytics and even though both support HLL  their implementation is not compatible, i.e. Presto can’t read HLL sketches serialized from Spark. To that end we created a Spark package, spark-hyperloglog, and a Presto plugin, presto-hyperloglog, to extend both systems with the same HLL implementation.

Usage

Mozilla employees can access the client_count table from re:dash and run Presto queries against it to compute DAU, MAU or the ER for a certain segment. For example, this is how DAU faceted by channel and activity date can be computed:


SELECT normalizedchannel,
activitydate,
cardinality(merge(cast(hll AS HLL))) AS hll
FROM client_count
GROUP BY activitydate, normalizedchannel
ORDER BY normalizedchannel, activitydate DESC



The merge aggregate function computes the set union while the cardinality function returns the cardinality of a sketch. A complete example with DAU, MAU and ER for the electrolysis engagement ratio can be seen here.

By default the HLL sketches in the client_count table have a standard error of about 1.6%.

# Telemetry meets SQL

In an effort to ease data access at Mozilla we started providing SQL access to our Parquet datasets through Presto. The benefit of SQL is that it’s a commonly understood language among engineers and non-engineers alike and is generally easy to pick up.

Even though we use Redshift for some of our data,  there are datasets that store complex hierarchical data which would require unnatural transformations to fit in a typical SQL store that doesn’t support the flexibility of a nested data model (think of structs, arrays & maps) such as the one provided by Parquet. Furthermore, we were looking to use a single store for our data, i.e. Parquet files on S3, which both Spark and a SQL engine could access directly.

Presto provides the best of both worlds: SQL over Parquet. Presto is an open-source distributed SQL query engine for running interactive analytic queries against various data sources. It allows querying different sources such as Hive and Cassandra, relational databases or even proprietary data stores and a single query can combine data from multiple sources.

Apache Drill offers similar features without requiring up-front schema knowledge, which is a big advantage over Presto given how painful a schema migration can be at times. Overall Drill feels less mature than Presto though and is not supported yet by Amazon EMR unlike Presto, which makes deployment & maintance more involved. For those reasons we picked Presto as our SQL engine.

Presto supports a rich set of data types which map to Parquet ones:

Arrays
The [] operator is used to access an element of an array:

SELECT my_array[1] AS first_element


Maps
The [] operator is used to retrieve the value corresponding to a given key from a map:

SELECT name_to_age_map['Bob'] AS bob_age


Structs
Fields within a structure are accessed with the field reference operator .:

SELECT my_column.my_field


Unnesting maps and structs
The UNNEST clause is used to expand an array or a map into a relation. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument:


SELECT numbers, animals, n, a
FROM (
VALUES
(ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']),
(ARRAY[7, 8, 9], ARRAY['cow', 'pig'])
) AS x (numbers, animals)
CROSS JOIN UNNEST(numbers, animals) AS t (n, a);



which yields:

  numbers  |     animals      |  n   |  a
-----------+------------------+------+------
[2, 5]    | [dog, cat, bird] |    2 | dog
[2, 5]    | [dog, cat, bird] |    5 | cat
[2, 5]    | [dog, cat, bird] | NULL | bird
[7, 8, 9] | [cow, pig]       |    7 | cow
[7, 8, 9] | [cow, pig]       |    8 | pig
[7, 8, 9] | [cow, pig]       |    9 | NULL


The longitudinal view is currently accessible through Presto.  As a concrete example, this is how one could count the number of Telemetry fragments over a time range:


SELECT d, count(*)
FROM (
SELECT substr(ssd, 1, 11) AS d
FROM longitudinal
CROSS JOIN UNNEST(subsession_start_date) AS s(ssd)
)
WHERE d >= '2015-11-15' AND d < '2016-03-15'
GROUP BY d
ORDER BY d ASC



Re:dash
Re:dash allows to query Presto directly from Firefox. After a query is run a table is displayed with the result. Queries can be saved and optionally scheduled to run periodically at a given time.

Different kinds of plots (e.g. bar charts, line charts, boxplots, …) can be built over a table which are updated every time the query is re-run.

Custom dashboards can be built that link to tables and plots. Users can visualize the dashboards and optionally access the SQL code that powers them and fork it.

Mozilla’s re:dash instance can be accessed at sql.telemetry.mozilla.org.

# Longitudinal studies with Telemetry

Unified Telemetry (UT) replaced both Telemetry and FHR at the end of last year. FHR has been historically used to answer longitudinal questions, such as churn, while Telemetry has mainly been used for performance studies.

In UT-land, multiple self-contained submissions are generated for a profile over time in contrast to FHR for which submissions contained all historical records. With the new format, a typical longitudinal query on the raw data requires conceptually a big group-by on the profile ID over all submissions to recreate the history for each profile. To avoid the expensive grouping operation we are providing a longitudinal batch view of our Telemetry dataset.

The longitudinal view is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of chronologically sorted values, one per Telemetry submission received for that profile. Even though each profile could have been represented as an array of structures with Parquet, ultimately we decided to represent it as a structure of arrays to deal with some inefficiencies in reading nested data from Spark.

The dataset is periodically regenerated from scratch, which allows us to apply non backward compatible changes to the schema and not worry about merging procedures.

The  dataset can be accessed from Mozilla’s Spark clusters as a DataFrame:


frame = sqlContext.sql(SELECT * FROM longitudinal)



The view contains several thousand measures, which include all histograms and a large part of our scalar metrics stored in the various sections of Telemetry submissions.

frame.printSchema()

root
|-- profile_id: string
|-- normalized_channel: string
|-- submission_date: array
|    |-- element: string
|-- build: array
|    |-- element: struct
|    |    |-- application_id: string
|    |    |-- application_name: string
|    |    |-- architecture: string
|    |    |-- architectures_in_binary: string
|    |    |-- build_id: string
|    |    |-- version: string
|    |    |-- vendor: string
|    |    |-- platform_version: string
|    |    |-- xpcom_abi: string
|    |    |-- hotfix_version: string
...


A Jupyter notebook with example queries that use the longitudinal dataset is available on Spark clusters launched from a.t.m.o.

# Detecting Talos regressions

This post is about modelling Talos data with a probabilistic model which can be applied to different use-cases, like detecting regressions and/or improvements over time.

Talos is Mozilla’s multiplatform performance testing framework written in python that we use to run and collect statistics of different performance tests after a push.

As a concrete example, this is how the performance data of a test might look like over time:

Even though there is some noise, which is exacerbated in this graph as the vertical axis doesn’t start from 0, we clearly see a shift of the distribution over time. We would like to detect such shifts as soon as possible after they happened.

Talos data has been known for a while to generate in some cases bi-modal data points that can break our current alerting engine.

Possible reasons for bi-modality are documented in Bug 908888. As past efforts to remove the bi-modal behavior at the source have failed we have to deal with it in our model.

The following are some notes originated from conversations with Joel, Kyle, Mauro and Saptarshi.

Mixture of Gaussians

The data can be modelled as a mixture of $K$ Gaussians, where the parameter $K$ could be determined by fitting $K$ models and selecting the best one according to some criteria.

The first obstacle is to estimate the parameters of the mixture from a set of data points. Let’s state this problem formally; if you are not interested in the mathematical derivation it suffices to know that scikit-learn has an efficient implementation of it.

EM Algorithm

We want to find the probability density $f(x)$, where $f$ is a mixture of $K$ Gaussians, that is most likely to have generated a given data point $x$:

$f(x; \theta) = \sum_{k=1}^{K} p_k g(x; \mu_k, \sigma_k)$

where $p_k$ is the mixing coefficient of cluster $K$, i.e. the probability that a generic point belongs to cluster $K$ so that $\sum_{k=1}^{K}p_k = 1$, and $g$ is the probability density function of the normal distribution:

$g(x; \mu_k, \theta_k) = \frac{1}{(\sqrt{2\pi\sigma_k})}e^{-\frac{(x - \mu_k)^2}{2\sigma_k^2}}$

Now, given a set of $N$ data points that are independent and identically distributed, we would like to determine the values of $p_k$, $\mu_k$ and $\sigma_k$ that maximize the log-likelihood function. Finding the maximum of a function often involves taking the derivative of a function and solving for the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function.

$\log\mathcal{L}(\theta | x_1, ..., n_n) = \log\prod_{i=1}^{N} f(x_i; \theta)$

To find a maximum of  $\mathcal{L}$, let’s compute the partial derivative of it wrt $\mu_k$, $\sigma_k$ and $p_k$. Since

$\frac{\partial{g(x_i; \mu_k, \theta_k)}}{\partial{\mu_k}} = g(x_i; \mu_k, \theta_k) \frac{\partial}{\partial{\mu_k}} [-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}] = \frac{g(x_i; \mu_k, \theta_k) (x_i - \mu_k)}{\sigma_k^2}$

then

$\frac{\partial\log\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{1}{\sigma_k^2}\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}(x_i - \mu_k)$

But, by Bayes’ Theorem, $\frac{p_kg(x_i; \mu_k, \sigma_k)}{\sum_{j=1}^{N}p_jg(x_i; \mu_j, \sigma_j)}$ is the conditional probability of selecting cluster $k$ given that the data point $x_i$ was observed, i.e. $p(k|x_i)$, so that:

$\frac{\log\partial\mathcal{L}}{\partial{\mu_k}} = \sum_{i=1}^{N}\frac{p(k|x_i)}{\sigma_k^2}(x_i - \mu_k)$

By applying a similar procedure to compute the partial derivative with respect to $\sigma_k$ and $p_k$ and finally setting the derivatives we just found to zero, we obtain:

$\mu_k = \frac{\sum_{i=1}^{N}p(k|x_i)x_i}{\sum_{i=1}^{N}p(k|x_i)}$

$\sigma_k = \sqrt{\frac{\sum_{i=1}^{N}p(k|x_i)(x_i - \mu_k)^2}{\sum_{n=1}^{N}p(k|x_i)}}$

$p_k = \frac{1}{N}\sum_{i=1}^{N}p(k|x_i)$

The first two equations turn out to be simply the sample mean and standard deviation of the data weighted by the conditional probability that component $k$ generated the data point $x_i$.

Since the terms $p(k|x_i)$ depend on all the terms on the left-hand side of the expressions above, the equations are hard to solve directly and this is where the EM algorithm comes to rescue. It can be proven that the EM algorithm convergences to a local maximum of the likelihood function when the following computations are iterated:

###### E Step

$p^{(n)}(k|x_i) = \frac{p_k^{(n)}g(x_i; \mu_k^{(n)}, \sigma_k^{(n)})}{\sum_{j=1}^{N}p_jg(x_i; \mu_j^{(n)}, \sigma_j^{(n)})}$

###### M Step

$\mu_k^{(n+1)} = \frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)x_i}{\sum_{i=1}^{N}p^{(n)}(k|x_i)}$

$\sigma_k^{(n+1)} = \sqrt{\frac{\sum_{i=1}^{N}p^{(n)}(k|x_i)(x_i - \mu^{(n+1)})^2}{\sum_{n=1}^{N}p^{(n)}(k|x_i)}}$

$p_k^{(n+1)} = \frac{1}{N}\sum_{i=1}^{N}p^{(n)}(k|x_i)$

Intuitively, in the E-step the parameters of the components are assumed to be given and the data points are soft-assigned to the clusters. In the M-step we compute the updated parameters for our clusters given the new assignment.

Determine K

Now that we have a way to fit a mixture of $K$ gaussians to our data, how do we determine $K$? One way to deal with it is to generate $K$ models and select the best one according to their BIC score. Adding more components to a model will fit the data better but doing so may result in overfitting. BIC prevents this problem by introducing a penalty term for the number of parameters in the model.

Regression Detection

A simple approach to detect changes in the series is to use a rolling window and compare the distribution of the first half of the window to the distribution of the second half. Since we are dealing with Gaussians, we can use the z-statistic to compare the mean of each component in the left window to mean of its corresponding component in the right window:

$z = \frac{\mu_l - \mu_r}{\sqrt{\frac{\sigma_l^2}{n_l} + \frac{\sigma_r^2}{n_r}}}$

In the following plots the red dots are points at which the regression detection would have fired. Ideally the system would generate a single alert per cluster for the first point after the distribution shift.

Talos generates hundreds of different time series, some with dominating and peculiar noise patterns. As such it’s hard to come up with a generic model that solves the problem for good and represents the data perfectly.

Since the API to access this data is public, it provides an exciting opportunity for a contributor to come up with better ways of representing it. Feel free to join us on #perf if you are interested. Oh and, did I mentions we are hiring a Senior Data Engineer?