Understanding Differential Privacy in Statistical Analysis

Slide Note
Embed
Share

Gain insight into the concept of differential privacy in statistical analysis through key terminologies, foundational ideas, and practical examples. Explore the balance between data privacy and statistical quality, and learn how differential privacy serves as a mathematical guarantee to protect individual entities while ensuring statistical accuracy.


Uploaded on Oct 02, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. A Practitioners Intro to Differential Privacy ASA Government Statistics Section Professional Development Webinar March 12, 2020 Matthew Graham Center For Economic Studies, U.S. Census Bureau 1

  2. Desiderata Disclaimer: Any opinions and conclusions expressed herein are those of the author and do not represent the views of the U.S. Census Bureau. This presentation does not use any confidential data. Any mistakes herein are those of the author. 2

  3. Introduction 3

  4. Todays Webinar What Will We Cover Foundational Ideas Terminology A Look at the Basic Math and its Semantics Some Solutions and Worked Examples What Won t We Cover Proofs Specific Census Bureau Products Introduction 4

  5. A Basic Setup 2. You are required to protect some aspect(s) of the data. 3. You want/need to publish statistics from that data. 1. You have a set of data. min 4 1.324 35 -2.333 max 56 12.091 1 6.777 4. You (and customers!) want the statistics to be of good quality. Introduction 5

  6. Key Questions 1. Exactly what aspects of the data (entities or their characteristics) are you required to protect (by current law/policy/agreement)? 2. What does good quality mean to you? To the people who will consume the statistics? These two questions get at the central tension: Privacy vs. Quality Introduction 6

  7. Differential Privacy Is An algorithm? A guarantee? Confusing? To some extent, To some extent, maybe all of these maybe all of these Counterintuitive? A proof? A work in progress? A useful tool? Denoted as DP for the rest of this presentation? Introduction 7

  8. DP as a Guarantee We want: Then: And: Published statistics and Privacy for an individual entity; Statistics should not change much With/without the entity; It should be mathematically provable. Introduction 8

  9. Common Terminology Database Any set of data (simple or complex) from which we are producing statistics to be protected. Neighbors Two databases that are different by a definable quantity (e.g. 1 record). Query A statistic (e.g. sum count) to be produced from the database, with the Query Response being its value. Sensitivity A measure of the impact on a query resulting from a change in a database s content. Introduction 9

  10. Basic ?-DP 10

  11. Basic Setup for ?-DP Two databases: ?1 and ?2. And they are neighbors Different by 1 record: ?1\?2 ?2\?1 Query, ? Private mechanism, : When we apply to a database, It answers, ?, in a way that preserves privacy Apply to ?1 and ?2: How likely is it that we can determine which database our answer comes from? = 1 Basic ?-DP 11

  12. ?-Differential Privacy Guarantee: A mechanism, , satisfies ?- differential privacy if for all outputs ? range , and for all neighbors ?1 and ?2: Pr ?1 ? Pr ?2 ? ?? Basic ?-DP 12

  13. ?-Differential Privacy Pr ?1 ? Pr ?2 ? ?? Semantics: The probability that some output came from ?1 should not be very different than the probability that it came from ?2. SDL: The inferential disclosure risk is bounded by ?? for all confidential data items in all possible data sets. Privacy-loss parameter: Higher ? means a lower privacy guarantee. o ? = 0, perfect privacy (no data!) o ? = , no privacy guarantee (risk!) Basic ?-DP 13

  14. Sensitivity By how much can a single entity affect outputs? That how much is the ?1 sensitivity: ?= ?1,?2 ????????? ?1 ? ?2 max Ensuring differential privacy generally requires adding noise scaled by the sensitivity. Basic ?-DP 14

  15. Laplace Mechanism But what is ? An example of an ?-differentially private mechanism is the Laplace mechanism: ?,? = ? ? + Lap ? where ? = Laplace distribution with scale=? and location=0. $10M Question: Does this meet our needs? Quality! ?? and Lap ? is a sample from the Basic ?-DP 15

  16. Data Quality Judge by its error: Measure of difference between the confidential true value and a releasable noisy/protected value. One measure is the ?1 error: ?1= ? ? where ? is the true cell value and ? is the noisy cell value. For a published dataset/table, the total error is: ?????= ?1 ?? ?? ? We may prefer to look at the relative error: ???= ?1 ?? ?? ?? 0 ? ? Basic ?-DP 16

  17. Getting a Sense of Privacy Loss True count: ? ? ? Noisy count: ? = ? + ? Error: ?1= ? ? ?1= ? ? + ? ?1= ? Noise: ?~Lap Setting ? = 1 and misusing the notation slightly: 1 ? ?1= Lap Now sample many times from the Laplace for different values of ?and see what we get Basic ?-DP 17

  18. Privacy Loss vs. Quality This plot is the output of a simulation of ?-DP with the Laplace mechanism, specifically looking at how much error we get for different values of ?. Basic ?-DP 18

  19. Worked Examples All examples are available at: https://github.com/mwerevu/dpdemo To the Jupyter Notebook! Basic ?-DP 19

  20. Beyond the Basics 20

  21. Parallel Composition Can the database (and the queries you pose) be separated into disjoint sets? # Count of star-bellied Sneetches? D (Sneetches) # No Count of plain-bellied Sneetches? Then each query can use the same ? without having an impact on the others. And we can just think of the query as a request for a table that has mutually exclusive cells from the original database: Belly Count D Belly table? # (Sneetches) No # Beyond the Basics 21

  22. Sequential Composition But now we want to publish an Adult/Child count for the Sneetches as well Belly Count # No # D Age Count (Sneetches) Adult # Child # In this case, each record (Sneetch in the database) contributes to 2 published cells and we must account for that with our privacy parameters, which will sum: 1 ? for the Belly table and 1 ? for the Age table. Altogether we can think of this approach as having a privacy loss of ??. Beyond the Basics 22

  23. Privacy-Loss Budgets Keeping track of how much ?we are using is the job of the Privacy-Loss Budget. The key is to understand the total (possible) privacy loss associated with all the releases that are planned. Additionally, this limit should focus the design of the protection system on optimizing the most important parts of the system. Release Table Privacy-Loss Budget Release Table Privacy-Loss Budget ? 4? Initial Bellies Initial Bellies x Age x Height x Beak -OR- ? Age 4? Revised Bellies x Age x Height x Beak ? Height 8? Total ? Beak ? Revised Bellies ? Age How do we choose between these options and others? ? Height Quality! ? Beak 8? Total Beyond the Basics 23

  24. We need to decide where we want to use limited privacy-loss budget: Tradeoffs Which do we prefer? Detailed Statistics Totals and Subtotals Total Bellies Age Height Beak Bellies x Age x Height x Beak Save a little budget to create interior cells. Or rake/publicly impute from the margins. Add up noisy cells to get noisy totals/subtotals. But we also know from the literature (Seuss, 1953) that Sneetches can change: ( ) (No ) Set aside some budget to be used for later tabulations on the same population? Whatever we choose, we should be considering our full publication strategy as we design the protection system. That way we can evaluate all the tradeoffs at design time instead of getting stuck with an unhappy compromise later. Beyond the Basics 24

  25. Wrap-Up 25

  26. Summary DP is a guarantee based on the data, the desired queries, the required protections, and the chosen protection mechanism. DP mechanisms use parameters like ? to adjust the tradeoff between the level of privacy loss and data quality. Query sensitivity and composition are important features to understand when developing a DP solution. Start early. Include multiple coordinated reviews (legal, policy, theory, quality, usability). Have a publication strategy that plans for future releases. Wrap-Up 26

  27. References Abowd et al. Introductory Readings in Formal Privacy for Economists. https://labordynamicsinstitute.github.io/privacy-bibliography/index.html Abowd and Schmutte. An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices. https://ideas.repec.org/p/cen/wpaper/18-35.html D Orazioet al. Differential Privacy for Social Science Inference. https://pdfs.semanticscholar.org/1975/708226a5b90f9fedc33891b5e43e335fbe95.pdf Dwork. A Firm Foundation for Private Data Analysis. https://www.microsoft.com/en- us/research/publication/a-firm-foundation-for-private-data-analysis/ Dwork and Roth. The Algorithmic Foundations of Differential Privacy. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf Machanavajjhala et al. Differential Privacy in the Wild: A tutorial on current practices & open challenges. http://www.vldb.org/pvldb/vol9/p1611-machanavajjhala.pdf Page et al. Differential privacy: an introduction for statistical agencies. https://gss.civilservice.gov.uk/wp-content/uploads/2018/12/12-12- 18_FINAL_Privitar_Kobbi_Nissim_article.pdf Wood at al. Differential Privacy: A Primer for a Non-Technical Audience. http://www.jetlaw.org/wp-content/uploads/2018/12/4_Wood_Final.pdf Wrap-Up 27

  28. Thank You matthew.graham@census.gov Access to Python examples: https://github.com/mwerevu/dpdemo Wrap-Up 28

Related


More Related Content