Differential Privacy: Embedding Privacy Into Data Usage

Published in

The Startup

10 min readJul 8, 2020

Introduction

Data is considered to be the “black gold” of the 21st century. It is everywhere, comes from multiple sources and is stored consistently. It has become part of companies’s DNA and is key to their strategies. Governments and organizations are big players in this mass data collection and analysis.

Although this is very valuable to data collectors (curators), large-scale data collection may not be harmless for individuals who see their personal information gathered and analysed. In fact, some datasets contain private and sensitive information. Privacy violation may occur and individual information can be revealed, either at the data collection stage itself or later at the data analysis stage. These violations are severely condemned with the enforcing of GDPR in EU since May 2018 for example.

While we might understand companies need to have a competitive advantage with the extensive use of customers data to improve their products and services, the protection of personal information is also a key legitimate claim.

Privacy by Design, that is the idea that IT systems should embed more privacy algorithms and techniques as early as possible from their design, is becoming a reality.

Apple’s iPhones, Google’s Chrome browser and Microsoft are some of the big players that have implemented privacy into their products to protect customers’ personal data while still generating business insights from this data.

Before getting into privacy mechanisms, let’s look at some privacy violations cases that stress out why such privacy mechanisms are needed.

Examples of privacy violations

Many examples of data privacy violations can be found in the litterature and in the news.

About 3 years ago, Bose Corporation was sued for allegedly using an app that tracked their customers’ music, podcasts and other audio they listened to, through their wireless headphones (see https://www.itnews.com.au/news/bose-sued-over-alleged-privacy-breach-458877 for more information).

The French Journal “Les Echos” on his November 21st, 2019 appearance reported that a subsidiary company of Accor Hotels had been hacked. Historical sentive data such as hotel and transportation bookings, billings and some partial informations of credit card of more than 130 000 customers had been leaked. The company reported this event to CNIL which is the French agency in charge of regulating data usage, protecting data subjects and issuing sanctions. CNIL in the exercise of his duties has already fined other companies such as Bouygues Telecom, Uber or Hotel Mariott for exposing customers personal data.

Everyone may remember what was coined in 2014 as the largest cybersecurity breach ever: personal data associated with more than 500 millions Yahoo’s users had been stolen as part of a massive attack on Yahoo’s network (see https://money.cnn.com/2016/09/22/technology/yahoo-data-breach/ for more information).

In addition to these privacy violations examples, we also have the well-known de-anonymization of medical dataset by Latanya and Netflix data science competition cases. Latanya Sweeney has linked voter registration to an anonymized medical dataset and easily re-identify people from the anonymized dataset.

According to the author, 87% of US population can be identified through the ZIP code, date of birth and gender [1].

Netflix released an anonymized dataset as part of a huge competition for data scientists. Using other open online datasets such the IMDb, anonymized users were easily identified and their personal information of movies preferences were revealed [2].

As the above cases are not isolated, there is a huge potential harm for personal data stored, whether this harm comes from the data collector itself (the company) or from an external individual or organization (an attacker).

As a consequence, combining a strong privacy mechansim with data storage is key to a safe data usage.

Privacy-preserving mechanism

There are numerous techniques designed to help ensure privacy in data. A state-of-the-art and appealing technique in this particular field is the Differential Privacy mechanism (or its variant the Local Differential Privacy mechanism).

Consider you were asked to take part in a study in social sciences to know whether you have a particular embarrassing or compromising characteristic P. To report your answer,

you are first asked to flip a coin
if it comes up tails, then respond truthfully
if heads, then flip a second coin and respond Yes if it is heads and No if it comes up tails

This is called a randomized process: it ensures you the plausible deniability of your answer.

Even if your answer is Yes, it doesn’t necessarily mean that you have the embarassing characteristic P.

This process can be seen simply as a formal definition of privacy embedded in data collection.

In the Privacy book by Cynthia Dwork and Aaron Roth [3], the promise of Differential Privacy (DP) is simple:

You will not be affected, adversely or otherwise by allowing your data to be used in any study or analysis no matter what other studies, datasets or informations sources are available

From an attacker’s viewpoint, Differential Privacy is preventing information leakage in a way that it only provides a noisy aggregated information of individuals in a dataset. For a dataset of n individuals, this implies Differential Privacy will not reveal the information of the last individual if an attacker holds the n-1 individuals information.

Fig 1 : Concept of Differential Privacy from [4]</i></center>

The mechanism which yields to this aggregated result is said to be differentially private.

So, mathematically speaking, let D and D’ be two neighboring datasets such as D’ has all individuals information in D but one less individual information (that is in D), let M be a randomized process and S the set of output. The mechanism M is ε-differentially private if :

The factor ε is called the privacy budget. This privacy budget defines the level of privacy embedded in the mechanism M. The more private the mechanism, the smaller ε.

How can one then design the mechnanism M in order to protect the individuals information in a dataset?

When a user sends a query to a database, the answer of the query is computed through a function f. Before reporting this answer to the user, a noise is added to this function. This noise can be drawn from many probabilistic distributions such as the Laplace distribution. The Laplace mechanism is suitable for numeric queries and the Exponential mechanism is used for non-numeric queries [4].

Before going to the formal definition, we need to know about what is called the sensitivity. The sensitivity of a query function is the amount by which a single individual’s information can change the query function. If you recall the neighboring datasets from above, the sensitivity of a mechanism on a count query is 1. That is, when we remove only one individual from the database, the result to the count query differs.

Then, the formal definition of a Laplace mechanism is:

Δf is the sensitivity of the query f

The Exponential mechanism is written :

r is the result of the query, i.e the output class for a categorical variable example, f is the chosen scoring function

Let’s consider this medical record dataset, in which we have informations on patients from a district:

For instance, if we send a query : ”what is the most common disease in the district”, we will first compute the count function of all diseases, then pick the highest one. When someone new settles down in the district and has his or her medical record saved, by sending the same query, we will be able to know whether that person has diabete or not for example. But if we apply an exponential mechanism to this query, we will get the following table:

Table 2: different privacy budget for the query from [4]

First we can notice that whatever privacy budget we choose, we never get the strictly exact result. We have a much different result from the privacy budget we choose. When ε is very close to 0, the probabilities are the same, which means the privacy constraint is too strong. When ε is very high (equal to 1), the privacy constraint is too weak to render the fair reality. A quite fair reality is rendered when we set ε to 0.1 as this gives us the nearly true frequencies. And this protects personal information of patients in the database.

The calculation of the values in the previous table is shown below for ε=0.1.

In this case, a = 0.09851033.

A glimpse into Local Differential Privacy (LDP)

Differential Privacy requires users to send their data to an aggregator, which in turn has to add noise to the data before any publishing or analysis. This implies users have to trust the data aggregator.

In the Local Differential Privacy paradigm, noise is added to the data by the users themselves before sending it to the data aggregator [4].

On modern smart devices, this is done by implementing the noising mechanism into the user’s device; the collected data is already prone to plausible deniability from the users.

This variant brings stronger guarantees on user’s data privacy without the need of a third party. For a complete and simple introduction to LDP, more informations may be found in [5, 6]. LDP is accepted in the community as the main way of embedding privacy for users’ data. For example, Google Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) is such an implementation of LDP. RAPPOR has been developped for Google Chrome web browser.

Differential Privacy in Machine Learning

Let’s now look into how one can implement a machine learning algorithm while ensuring privacy.

Let’s consider a simple machine learning algorithm, the basic decision tree. Geetha Jagannathan et al. [7] proposed to build a differentially private classifier, not from a simple decision tree but from random decision trees which they have shown to prove good results for small datasets. The steps to build this algorithm are the following:

N is the total number of trees to be constructed
For every tree Ti, split the data in multiple random partitions
Within every final node of the tree, add a Laplacian noise to the count of samples for each value of the label

Fig 2 : Illustration of DP in tree construction

To predict a new sample, first make an inference of the decision trees onto the new sample, collect all the partitions from all trees where the sample appears, sums the count of each label from the collected partitions and computes the probabilities of the sample to belong to either class

As for Deep Learning methods, the work by Martin Abadi et al. [6] show how privacy can be ensured while training a neural network. They have (re)designed the Stochastic Gradient Descent algorithm and the hyperparameters tuning under a privacy budget constraint. This privacy constraint has been implemented in Tensorflow, see https://github.com/tensorflow/privacy for more information.

Conclusion

Digitalisation has eased data collection from groups of people, especially through modern personal devices. Many privacy violations have weakened trust between customers and companies, hence the request for more privacy in the collection or usage of data is at its peak.

(Local) Differential Privacy, which is a technique that adds noise to the data to ensure a plausible deniability from users while carrying out a data analysis, provides a solution to this problem.

At the time of this article’s publication, there is an ongoing debate in using a mobile app to track people who are asymptomatic of COVID-19 in order to protect public health and economy. The concern is the level of privacy this app could embed as it will track people in their geolocation nearly all the time. Using (L)DP can be an effective way to ensure privacy in such app and grant an easy acceptation of the geolocation by people.

The privacy field is getting a lot of attention and the community is evolving. Such methods may be widely considered to protect more people.

About the author

Charles DOKODJO is a Data Scientist at Capgemini Invent France. He had the opportunity to work on many clients projects ranging from computer vision to text analysis and classical machine learning as well. He has also contributed to students training within partner universities of Capgemini Invent. He recently discovered the field of Ethical AI for which he got passionate about.

References

[1]: Latanya Sweeney, k-anonymity: a model for protecting privacy, https://epic.org/privacy/reidentification/Sweeney_Article.pdf

[2]: Arvind Narayanan and Vitaly Shmatikov, Robust De-anonymization of Large Sparse Datasets, https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

[3]: C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science, vol. 9, nos. 3–4, pp. 211–407, 2014, https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

[4]: Tianquing Zhu, Gang li, Wanlei Zhou, Phillip Yu, Differential Privacy and Applications, https://www.springer.com/gp/book/9783319620022

[5]: Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang, Privacy at Scale: Local Differential Privacy in Practice, in SIGMOD’18: 2018 International Conference on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY, USA, https://doi.org/10.1145/3183713.3197390

[6]: Björn Bebensee, Local Differential Privacy : a tutorial, https://arxiv.org/pdf/1907.11908.pdf

[7]: Geetha Jagannathan, Krishnan Pillaipakkamnatt, Rebecca N. Wright, A Practical Differentially Private Random Decision Tree Classifier, https://columbia.github.io/private-systems-class/papers/Jagannathan2012APractical.pdf