What is Differential Privacy?

Introduction

One of the notable concepts to emerge from Apple’s World Wide Developers Conference in San Francisco this year, has been the notion of differential privacy. As Wired puts it Differential Privacy is the “…statistical science of trying to learn as much as possible about a group while learning as little as possible about any individual in it.”

Let’s say you wanted to count how many of your online friends were dogs, while respecting the maxim that, on the Internet, nobody should know you’re a dog. To do this, you could ask each friend to answer the question “Are you a dog?” in the following way. Each friend should flip a coin in secret, and answer the question truthfully if the coin came up heads; but, if the coin came up tails, that friend should always say “Yes” regardless.

Then you could get a good estimate of the true count from the greater-than-half fraction of your friends that answered “Yes”. However, you still wouldn’t know which of your friends was a dog: each answer “Yes” would most likely be due to that friend’s coin flip coming up tails. (source: Google)

The premise of differential privacy lies in that the reports are indistiguishable, the random coin flips have no unique identifiers, yet the aggregation of reports allow us to share common results shared by many users. With companies like Facebook and Google constantly receiving flack for compromising user privacy in lieu of the value of selling customer insights, and user-targeted advertising, Apple are more-or-less perceived as the beacon when it comes to championing user privacy. Differential Privacy is Apple’s answer as the industry embeds itself more and more in mobile and wearable renaissance, especially with location-capable devices making privacy-invasiveness more frequent.

Whilst Apple have never had a great interest in user data, compared to the others, as it slowly and publicly revealed its intentions to work with big data, to improve Siri’s contextual knowledge, instead of going the Google-way, it has instead opted for the sliding-scale modelof differential privacy.

“People have entrusted us with their most personal information. We owe them nothing less than the best protections that we can possibly provide.” (Tim Cook, White House Cybersecurity Summit, ’15).

As consumers become more security and privacy-conscious, in light of the NSA and other global revelations, it has never been more imperative that companies and developers recognize the importance of safeguarding user information. Great applications do that through the following criterion:

Transparency in storage and use of personal data;
Consent and Control of what personal data is made available;
Security and Protection of personal data, through encryption;
Use Limitation, to only what is required, at the time that it is required.

This article will dive into why it is vital that developers adopt it, and subsequently, ways in which you can implment it it in your iOS app.

WHAT IS DIFFERENTIAL PRIVACY?

Differential Privacy is in fact an obscure branch of mathematics that aims to analyze a group more, whilst annonimizing individual users. The thesis was aithored by Professor Aaron Roth of the University of Pennsylvania, and Apple has not only embraced this as a service that will flow through its own services, but strongly advocating it for its third-party developer community.

“The concept behind differential privacy is the idea of obscuring or introducing ”noise“ into big data results to mask individual inputs while still getting useful information on larger trends ” according to Roth (as cited in AppleInsider), ensuring that individual information is not revealed, but still gaining a sample representation, for statistical purposes.

Apple has already started roilling out differential privacy on their apps and services, starting from iOS 10. On their messaging app, Apple is better predicting user input based on words used via aggregated big data. Apple has improved its search suggestion feature, where it previously ensured all Spotlight data was stored only locally, with developers deciding what data to share in spotlight (through indexing).

With the latest iOS iteration, users will choose whether to submit more public data about their activities, and Apple adds random noise to make it impossible for data to be traced back to single users.

Differential privacy addresses anonymity by providing enough information about a crowd, without the association of the individual, but with enough answers the noise of randomness could deduce through calculation to produce a relatively accurate distribution. There are some problems that are inherent with differential privacy.

LEVEL OF DATA COLLECTED

Data is partially recoverable, depending on on how much data from an individual party is collected. Apple resolves this through sending a subset of data (a small sample), similar to exit polls in elections, representing the intended distribution.

Suppose you have access to a database that allows you to compute the total income of all residents in a certain area. If you knew that Mr. White was going to move to another area, simply querying this database before and after his move would allow you to deduce his income. The answer is to create an approximation of the total income, getting accurate information whilst protecting Mr White. (source: (Neustar))

LEVEL OF NOISE USED TO OBSCURE DATA

Another issue, as we had pointed out, what if there are only a few users, or a small sample, due to the size of the distribution? How much noise should be used to obscure data? Apple when adding noise, also disregard some data on a regular basis, whilst other get uploaded, meaning data is transient, rather than historically emperical, adding more weight to the noise aspect.

“Users can opt to not send any data at all, he says, and Apple will additionally discard IP addresses before storing information on its end to avoid connecting even noisy data with its origin” (source: Macworld)

The level of noise is adjusted depending on the density of the sample, such as if there are only two or three users, you will want to adjust your noise more, to ensure that the data count never deduce back to any individual user. If say there are only one or two people in a village, Apple will introduce noise, ensuring you know the frequency but not the whom.

LEVEL AND FREQUENCY OF THE SAME QUESTION ASKED

Asking users the same question, or even similar questions across a specific time-period can also lead to the de-anonymity of data.

…asking too many similar questions, matters become more subtle. If you ask someone if they’re a member of the Communist party, and then ask if they admire Joseph Stalin, and then ask the ideal economic and political system, and so on, it’s possible an outside observer would eventually penetrate through the noise and determine an attitude on a given topic…. (source: Macworld).

This leads us to the subject of privacy budget.

Privacy Budget

Privacy budget is the notion of limiting how much data from the same (or related subject) is transmitted over a period of time. Premised with the fact that asking the same question over and over again over a short period of time, and getting the same answers could lead to determining the truthful answer, privacy budgets are scales that developers work with to balance the reliability of collected and aggregated data against the prospects of re-identification.

This is a complicated mathematical theory in itself, which is best illustrated in Anthony Tockar’s research paper. Suffice to say, the privacy budget measurement is a more important control measure than the statistical upper bound of a query, so that once a query budget is exceeded by the user, the user will not be able to make any further queries. This is what essentially defines diffrential privacy and privacy budget.

Other Measures to Secure User Data

I mentioned at the start of this article, the four important aspects that amount to the secure guardianship of your user’s data, together with differential privacy.

TRANSPARENCY

Users demand transparency when it comes to how their information is stored and used, so providing disclaimer to that effect is imperative, as a good model citizen developer. They still own their information, giving you access to their information means they need to understand how you plan on using their data.

In iOS 10, Apple’s Advertising network has an icon that allows users to see how some of their non-identifiable data made the ad relevant.

CONSENT & CONTROL

This leads onto consent and control, giving users the ability to consent as to whether they want to give you information, and how much information to give you, as well as the ability to rescind their permissions. This should be explicit, and iOS for the most part provides an easy way for you to ask permission before you access their personal information, be it address book, photos library, location and so on.

Whilst system settings affords users the right to revoke permissions, you should make it easier, by including in your app, menu settings that allow for toggling as well as adjusting different permissions, in an easy way. Most importantly, if the user does rescind permissions, they don’t get a poorer experience, but instead get a less identifiable experience.

SECURITY & PROTECTION

Of course, it goes without saying that the information you take consentually should always be securely stored, through enryption, using best-practices. If you are unable to provide strong encryption, don’t store personal information, its a simple as that. If there is a breach, as has been the case with numerous companies, like LinkedIn, reset user’s passwords immediately and provide them with notice, expeditiously.

USE LIMITATION

Finally, use limitation refers to only taking what you need, not what you want, in a timely manner. Understand what statistics you need, and take only the attributes that make that, rather than setting a blanket Facebook permissions box that asks for everything, in the hope that some of it may be useful to you.

Conclusion

Differential Privacy is a highly theoretical mathematical concept, that was devised some few years ago, but with privacy front-and-center in people’s minds, Apple have decided to make this part of their platform for iOS 10 and beyond, despite there not being widespread proof that this works in practice. Safeguarding user privacy, Apple have nevertheless placed a lot of trust and effort to push this initiative.

Through privacy budget, this methodology promises to gather crowd information without distinguishing individual users, through the addition of noise, as well as through the sampling of smaller subsets of distributions.

This is a very interesting area, and with Google also expressing an interest in the theory, we may see this as the future industry-standard. In fact, this may eventuate to the arms of governments where future legislation may entrust companies with stricter compliances. On a final note, I would recommend watching Apple’s Engineering Privacy for your Users WWDC session, which provides an excellent insight into Apple’s take on the theory.

Doron Katz