Differential Privacy for Dummies

Technology allows companies to collect more data and with more detail about their users than ever before. Sometimes that data is sold to third parties, other times it’s used to improve products and services.

In order to protect users’ privacy, anonymization techniques can be used to strip away any piece of personally identifiable data and let analysts access only what’s strictly necessary. As the Netflix competition in 2007 has shown though, that can go awry. The richness of data allows to identify users through a sometimes surprising combination of variables like the dates on which an individual watched certain movies. A simple join between an anonymized datasets and one of many publicly available, non-anonymized ones, can re-identify anonymized data.

Aggregated data is not much safer either under some circumstances! For example, say we have two summary statistics: one is the number of users, including Frank, that watch one movie per day and the other is the number of users, without Frank, that watch one movie per day. Then, by comparing the counts, we could tell if Frank watches one movie per day.

Differential Privacy to the rescue

Differential privacy formalizes the idea that a query should not reveal whether any one person is present in a dataset, much less what their data are. Imagine two otherwise identical datasets, one with your information in it, and one without it. Differential Privacy ensures that the probability that a query will produce a given result is nearly the same whether it’s conducted on the first or second dataset. The idea is that if an individual’s data doesn’t significantly affect the outcome of a query, then he might be OK in giving his information up as it is unlikely that the information would be tied back to him. The result of the query can damage an individual regardless of his presence in a dataset though. For example, if an analysis on a medical dataset finds a correlation between lung cancer and smoking, then the health insurance cost for a particular smoker might increase regardless of his presence in the study.

More formally, differential privacy requires that the probability of a query producing any given output changes by at most a multiplicative factor when a record (e.g. an individual) is added or removed from the input. The largest multiplicative factor quantifies the amount of privacy difference. This sounds harder than it actually is and the next sections will iterate on the concept with various examples, but first we need to define a few terms.

Dataset

We will think of a dataset d as being a collection of records from an universe U . One way to represent a dataset d is with a histogram in which each entry d_i represents the number of elements in the dataset equal to u_i \in U . For example, say we collected data about coin flips of three individuals, then given the universe U = \{head, tail\} , our dataset d would have two entries: d_{head} = i and d_{tail} = j , where i + j = 3 . Note that in reality a dataset is likely to be an ordered lists of rows (i.e. a table) but the former representation makes the math a tad easier.

Distance

Given the previous definition of dataset, we can define the distance between two datasets x, y with the l_1 norm as:

\displaystyle ||x - y||_1 = \sum\limits_{i=1}^{|U|}  |x_i - y_i|

Mechanism

A mechanism is an algorithm that takes as input a dataset and returns an output, so it can really be anything, like a number, a statistical model or some aggregate. Using the previous coin-flipping example, if mechanism C counts the number of individuals in the dataset, then C(d) = 3 . In practice though we will specifically considering randomized mechanisms, where the randomization is used to add privacy protection.

Differential Privacy

A mechanism M satisfies \epsilon differential privacy if for every pair of datasets x, y such that ||x - y||_1 \leq 1 , and for every subset S \subseteq \text{Range}(M) :

\displaystyle \frac{Pr[M(x) \in S]}{Pr[M(y) \in S]} \leq e^{\epsilon}

What’s important to understand is that the previous statement is just a definition. The  definition  is  not  an  algorithm,  but  merely  a  condition that must be satisfied by a mechanism to claim that it satisfies \epsilon differential privacy. Differential privacy allows researchers to use a common framework to study algorithms and compare their privacy guarantees.

Let’s check if our mechanism C satisfies 1 differential privacy. Can we find a counter-example for which:

\displaystyle \frac{Pr[C(x) \in S]}{Pr[C(y) \in S]} \leq e

is false? Given x, y such that ||x - y||_1 = 1 and ||x||_1 = k , then:

\displaystyle \frac{Pr[C(x) = k]}{Pr[C(y) = k]} \leq e

i.e. \frac{1}{0} \leq e , which is clearly false, hence this proves that mechanism C doesn’t satisfy 1 differential privacy.

Composition theorems

A powerful property of differential privacy is that mechanisms can easily be composed. These require the key assumption that the mechanisms operate independently given the data.

Let d be a dataset and g an arbitrary function. Then, the sequential composition theorem asserts that if M_I(d) is \epsilon_i differentially private, then M(d) = g(M_1(d), M_2(d), ..., M_N(d)) is \sum_{i=1}^{N}\epsilon_i differentially private. Intuitively this means that given an overall fixed privacy budget, the more mechanisms are applied to the same dataset, the more the available privacy budget for each individual mechanism will decrease.

The parallel composition theorem asserts that given N partitions of a dataset d , if for an arbitrary partition d_i , M_i(d_i) is \epsilon differentially private, then M(d) = g(M_1(d_1), M_2(d_2), ..., M_N(d_N)) is \epsilon differentially private. In other words, if a set of \epsilon differentially private mechanisms is applied to a set of disjoint subsets of a dataset, then the combined mechanism is still \epsilon differentially private.

The randomized response mechanism

The first mechanism we will look into is “randomized response”, a technique developed in the sixties by social scientists to collect data about embarrassing or illegal behavior. The study participants have to answer a yes-no question in secret using the following mechanism M_R(d, \alpha, \beta) :

  1. Flip a biased coin with probability of heads \alpha ;
  2. If heads, then answer truthfully with d ;
  3. If tails, flip a coin with probability of heads \beta and answer “yes” for heads and “no” for tails.

In code:

def randomized_response_mechanism(d, alpha, beta):
    if random() < alpha:
        return d
    elif random() < beta:
        return 1
    else:
        return 0

Privacy is guaranteed by the noise added to the answers. For example, when the question refers to some illegal activity, answering “yes” is not incriminating as the answer occurs with a non-negligible probability whether or not it reflects reality, assuming \alpha and \beta are tuned properly.

Let’s try to estimate the proportion p of participants that have answered “yes”. Each participant can be modeled with a Bernoulli variable X_i which takes a value of 0 for “no” and a value of 1 for “yes”. We know that:

\displaystyle P(X_i = 1) = \alpha p + (1 - \alpha) \beta

Solving for p yields:

\displaystyle p = \frac{P(X_i = 1) - (1 - \alpha) \beta}{\alpha}

Given a sample of size n , we can estimate P(X_i = 1)  with \frac{\sum_{i=1}^{i=n} X_i}{n} .  Then, the estimate \hat{p} of p is:

\displaystyle \hat{p} = \frac{\frac{\sum\limits_{i=1}^{i=n} X_i}{n} - (1 - \alpha) \beta}{\alpha}

To determine how accurate our estimate is we will need to compute its standard deviation. Assuming the individual responses X_i are independent, and using basic properties of the variance,

\displaystyle \mathrm{Var}(\hat{p}) = \mathrm{Var}\biggl({\frac{\sum_{i=1}^{i=n} X_i}{n \alpha}}\biggr) = {\frac{\mathrm{Var}(X_i)}{n \alpha^2}}

By taking the square root of the variance we can determine the standard deviation of \hat{p} . It follows that the standard deviation s is proportional to \frac{1}{\sqrt{n}} , since the other factors are not dependent on the number of participants. Multiplying both \hat{p} and s by n yields the estimate of the number of participants that answered “yes” and its relative accuracy expressed in number of participants, which is proportional to \sqrt{n} .

The next step is to determine the level of privacy that the randomized response method guarantees. Let’s pick an arbitrary participant. The dataset d is represented with either 0 or 1 depending on whether the participant answered truthfully with a “no” or “yes”. Let’s call the two possible configurations of the dataset respectively d_{no} and d_{yes} . We also know that ||d_i - d_j||_1 \leq 1 for any i, j . All that’s left to do is to apply the definition of differential privacy to our randomized response mechanism M_R(d, \alpha, \beta) :

\displaystyle \frac{Pr[M_R(d_{i}, \alpha, \beta) = k]}{Pr[M_R(d_{j}, \alpha, \beta) = k]} \leq e^{\epsilon}

The definition of differential privacy applies to all possible configurations of i, j \text{ and } k , e.g.:

\displaystyle \begin{gathered}\frac{Pr[M_R(d_{yes}, \alpha, \beta) = 1]}{Pr[M_R(d_{no}, \alpha, \beta) = 1]} \leq e^{\epsilon} \\\ln \left( {\frac{\alpha + (1 - \alpha)\beta}{1 - (\alpha + (1 - \alpha)\beta)}} \right) \leq \epsilon \\\ln \left( \frac{\alpha + \beta - \alpha \beta}{1 - (\alpha + \beta - \alpha \beta)} \right) \leq \epsilon\end{gathered}

The privacy parameter \epsilon can be tuned by varying \alpha \text{ and } \beta . For example, it can be shown that the randomized response mechanism with \alpha = \frac{1}{2} and \beta = \frac{1}{2} satisfies \ln{3} differential privacy.

The proof applies to a dataset that contains only the data of a single participant, so how does this mechanism scale with multiple participants? It follows from the parallel composition theorem that the combination of \epsilon differentially private mechanisms applied to the datasets of the individual participants is \epsilon differentially private as well.

The Laplace mechanism

The Laplace mechanism is used to privatize a numeric query. For simplicity we are going to assume that we are only interested in counting queries f , i.e. queries that count individuals, hence we can make the assumption that adding or removing an individual will affect the result of the query by at most 1.

The way the Laplace mechanism works is by perturbing a counting query f with noise distributed according to a Laplace distribution centered at 0 with scale b = \frac{1}{\epsilon} ,

\displaystyle Lap(x | b) = \frac{1}{2b}e^{- \frac{|x|}{b}}

Then, the Laplace mechanism is defined as:

\displaystyle M_L(x, f, \epsilon) = f(x) + Z

where Z is a random variable drawn from Lap(\frac{1}{\epsilon}) .

In code:

def laplace_mechanism(data, f, eps):
    return f(data) + laplace(0, 1.0/eps)

It can be shown that the mechanism preserves \epsilon differential privacy. Given two datasets x, y such that ||x - y||_1 \leq 1 and a function f which returns a real number from a dataset, let p_x denote the probability density function of M_L(x, f, \epsilon) and p_y the probability density function of M_L(y, f, \epsilon) . Given an arbitrary real point k ,

\displaystyle \frac{p_x(k)}{p_y(k)} = \frac{e^{- \epsilon |f(x) - k|}}{e^{- \epsilon |f(y) - k|}} =

\displaystyle e^{\epsilon (|f(x) - k| - |f(y) - k|)} \leq

\displaystyle e^{\epsilon |f(x) - f(y)|}

by the triangle inequality. Then,

\displaystyle e^{\epsilon |f(x) - f(y)|} \leq e^\epsilon

What about the accuracy of the Laplace mechanism? From the cumulative distribution function of the Laplace distribution it follows that if Z \sim Lap(b) , then Pr[|Z| \ge t \times b] = e^{-t} . Hence, let k = M_L(x, f, \epsilon) and \forall \delta \in (0, 1] :

\displaystyle Pr \left[|f(x) - k| \ge \ln{\left (\frac{1}{\delta} \right)} \times \frac{1}{\epsilon} \right] =

\displaystyle Pr \left[|Z| \ge \ln{\left (\frac{1}{\delta} \right)} \times \frac{1}{\epsilon} \right] = \delta

where Z \sim Lap(\frac{1}{\epsilon}) . The previous equation sets a probalistic bound to the accuracy of the Laplace mechanism that, unlike the randomized response, does not depend on the number of participants n .

Counting queries

The same query can be answered by different mechanisms with the same level of differential privacy. Not all mechanisms are born equally though; performance and accuracy have to be taken into account when deciding which mechanism to pick.

As a concrete example, let’s say there are n individuals and we want to implement a query that counts how many possess a certain property p . Each individual can be represented with a Bernoulli random variable:

participants = binomial(1, p, n)

We will implement the query using both the randomized response mechanism M_R(d, \frac{1}{2}, \frac{1}{2}) , which we know by now to satisfy \ln{3} differential privacy, and the Laplace mechanism M_L(d, f, \ln{3}) which satisfies \ln{3} differential privacy as well.

def randomized_response_count(data, alpha, beta):
    randomized_data = randomized_response_mechanism(data, alpha, beta)
    return len(data) * (randomized_data.mean() - (1 - alpha)*beta)/alpha

def laplace_count(data, eps):
    return laplace_mechanism(data, np.sum, eps)

r = randomized_response_count(participants, 0.5, 0.5)
l = laplace_count(participants, log(3))

Note that while that while M_R is applied to each individual response and later combined in a single result, i.e. the estimated count, M_L is applied directly to the count, which is intuitively why M_R is noisier than M_L . How much noisier?  We can easily simulate the distribution of the accuracy for both mechanisms with:

def randomized_response_accuracy_simulation(data, alpha, beta, n_samples=1000):
    return [randomized_response_count(data, alpha, beta) - data.sum()
           for _ in range(n_samples)]

def laplace_accuracy_simulation(data, eps, n_samples=1000):
    return [laplace_count(data, eps) - data.sum()
            for _ in range(n_samples)]

r_d = randomized_response_accuracy_simulation(participants, 0.5, 0.5)
l_d = laplace_accuracy_simulation(participants, log(3))

As mentioned earlier, the accuracy of M_R grows with the square root of the number of participants:

index
Randomized Response Mechanism Accuracy

while the accuracy of M_L is a constant:

index
Laplace Mechanism Accuracy

You might wonder why one would use the randomized response mechanism if it’s worse in terms of accuracy compared to the Laplace one. The thing about the Laplace mechanism is that the private data about the users has to be collected and stored, as the noise is applied to the aggregated data. So even with the best of intentions there is the remote possibility that an attacker might get access to it.  The randomized response mechanism though applies the noise directly to the individual responses of the users and so only the perturbed responses are collected! With the latter mechanism any individual’s information cannot be learned with certainty, but an aggregator can still infer population statistics.

That said, the choice of mechanism is ultimately a question of which entities to trust. In the medical world, one may trust the data collectors (e.g. researchers), but not the general community who will be accessing the data. Thus one collects the private data in the clear, but then derivatives of it are released on request with protections. However, in the online world, the user is generally looking to protect their data from the data collector itself, and so there is a need to prevent the data collector from ever accumulating the full dataset in the clear.

Real world use-cases

The algorithms presented in this post can be used to answer simple counting queries. There are many more mechanisms out there used to implement complex statistical procedures like machine learning models. The concept behind them is the same though: there is a certain function that needs to be computed over a dataset in a privacy preserving manner and noise is used to mask an individual’s original data values.

One such mechanism is RAPPOR, an approach pioneered by Google to collect frequencies of an arbitrary set of strings. The idea behind it is to collect vectors of bits from users where each bit is perturbed with the randomized response mechanism.  The bit-vector might represent a set of binary answers to a group of questions, a value from a known dictionary or, more interestingly, a generic string encoded through a Bloom filter. The bit-vectors are aggregated and the expected count for each bit is computed in a similar way as shown previously in this post. Then, a statistical model is fit to estimate the frequency of a candidate set of known strings. The main drawback with this approach is that it requires a known dictionary.

Later on the approach has been improved to infer the collected strings without the need of a known dictionary at the cost of accuracy and performance. To give you an idea, to estimate a distribution  over  an  unknown  dictionary  of  6-letter  strings without knowing  the  dictionary,  in  the  worst  case,  a sample size in  the  order  of  300  million  is required; the sample size grows quickly as the length of the strings increases. That said, the mechanism consistently finds the most frequent strings which enable to learn the dominant trends of a population.

Even though the theoretical frontier of differential privacy is expanding quickly there are only a handful implementations out there that, by ensuring  privacy without the need for a trusted third party like RAPPOR, suit well the kind of data collection schemes commonly used in the software industry.

References

Best practices for peer-reviewing a data analysis

Writing good code is hard, writing a good analysis is harder. Peer-review is an essential tool to fight repetitive errors, omissions and more generally divulge knowledge. I found the use of a checklist to be invaluable to help me remember the most important things I should watch out for during a review. It’s far too easy to focus on few details and ignore others which might be catched (or not) in a successive round.

I don’t religiously apply every bullet point of the following checklist to every analysis, nor is this list complete; more items would have to be added depending on the language, framework, libraries, models, etc. used.

  • Is the question the analysis should answer clearly stated?
  • Is the best/fastest dataset that can answer the question being used?
  • Do the variables used measure the right thing (e.g. submission date vs activity date)?
  • Is a representative sample being used?
  • Are all data inputs checked (for the correct type, length, format, and range) and encoded?
  • Do outliers need to be filtered or treated differently?
  • Is seasonality being accounted for?
  • Is sufficient data being used to answer the question?
  • Are comparisons performed with hypotheses tests?
  • Are estimates bounded with confidence intervals?
  • Should the results be normalized?
  • If any statistical method is being used, are the assumptions of the model met?
  • Is correlation confused with causation?
  • Does each plot communicate an important piece of information or address a question of interest?
  • Are legends and axes labelled and do the they start from 0?
  • Is the analysis easily reproducible?
  • Does the code work, i.e. does it perform its intended function?
  • Is there a more efficient way to solve the problem, assuming performance matters?
  • Does the code read like prose?
  • Does the code conform to the agreed coding conventions?
  • Is there any redundant or duplicate code?
  • Is the code as modular as possible?
  • Can any global variables be replaced?
  • Is there any commented out code and can it be removed?
  • Is logging missing?
  • Can any of the code be replaced with library functions?
  • Can any debugging code be removed?
  • Where third-party utilities are used, are returning errors being caught?
  • Is any public API commented?
  • Is any unusual behavior or edge-case handling described?
  • Is there any incomplete code? If so, should it be removed or flagged with a suitable marker like ‘TODO’?
  • Is the code easily testable?
  • Do tests exist and do they actually test that the code is performing the intended functionality?