Adapting Surveys for Formal Privacy: Current Trends and Challenges
This presentation by Aref Dajani discusses the importance of managing disclosure risk and protecting respondent confidentiality in surveys. It explores the implementation of formal privacy measures, challenges faced, and the transition process. Key topics include data ethics, data stewardship values, and a SWOT analysis highlighting the need for mathematically provable guarantees in privacy protection. Collaboration with international experts and a dedicated team is emphasized for successful outcomes.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Adapting Surveys for Formal Privacy: Where We Are, Where We Are Going Aref Dajani, Lead, Innovation and Review Group Center for Enterprise Dissemination Disclosure Avoidance U.S. Census Bureau Federal Committee on Statistical Methodology Research and Methodology Conference November 3, 2021 This presentation is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Any views expressed on statistical, technical, or operational issues are those of the author and not necessarily those of the U.S. Census Bureau. Slide 1 of 24
A Clarification and Acknowledgments There are several definitions of the word privacy in the literature. In this presentation, privacy refers to managing disclosure risk and protecting the confidentiality of survey respondents. General information about privacy is documented after the last slide of this slideshow: Page 25 of 24. Sincere thanks to my hard-working colleagues in the Center for Enterprise Dissemination Disclosure Avoidance. Many of them, too many to list, offered invaluable feedback on earlier drafts of this presentation, including my dry run. Continued thanks to a large team of international experts in formal privacy with whom we collaborate. It takes a village; one even flies his own plane to Washington to meet with us! Slide 2 of 24
Roadmap of Presentation Context for the need for formal privacy Implementation of formal privacy: 2008-present Challenges that lie ahead Managing the transition to formal privacy for our surveys Wrap up Slide 3 of 24
Data Ethics and Data Stewardship Values are agreed norms that allow teams to achieve a shared vision. Ethics are rules, regulations, and affirmations that allow individuals to work in accordance to established principles and guidelines. Integrity is being compliant with established values and ethics within an organization: doing the right thing when no one is looking. Data Ethics describe a code of behavior to use when collecting, processing, cleaning, wrangling, analyzing, and disseminating data. According to the Data Governance Institute: Data Stewardship is concerned with taking care of data assets that do not belong to the stewards themselves. Data Stewards represent the concerns of others. Some may represent the needs of the entire organization. Slide 4 of 24
Strength, Weakness, Opportunity, and Threat (SWOT) Analysis -- Presented in TSWO Order Threat: Savvy outside actors with high speed computing power and access to external microdata with personally identifiable information (PII) make it increasingly difficult to protect our data. Strength: The Census Bureau has implemented legacy methods of privacy protection for several decades. All information products undergo disclosure review before release. Weakness: Legacy methods do not quantify disclosure risk. Legacy methods tend to make very strong assumptions about attackers or are ambiguous about the assumptions they've made. Opportunity: Formal privacy offers mathematically provable guarantees that quantify disclosure risk. This allows us to disseminate results with the granularity that our stakeholders seek. This also allow analysts to account for disclosure protection in their data analysis and inference. Slide 5 of 24
Threats to Privacy Protection (Slide 1 of 2) There are many possible ways to uncover protected confidential data from disseminated information products. We do not know all the ways that external intruders conduct privacy attacks against our datanow or ways they may conduct privacy attacks in the future. The following slide lists three common threats we know about that outside agents use to engage in privacy attacks. Slide 6 of 24
Threats to Privacy Protection (Slide 2 of 2) (1) Database reconstruction: Computational efficiency makes it easier to reconstruct protected microdata records from published summaries that were generated from protected microdata. (2) Data re-identification: The advent of big data makes it easier to link public use microdata to external data using non-protected information in common, thereby re-identifying respondents whose identity we are trying to protect. (3) Differencing attacks akadisclosure by subtraction: From published tables from protected microdata, one can subtract rows or columns to obtain disclosive slivers of protected information, whether the slivers are longitudinal, geographic, or demographic. Slide 7 of 24
Example of a Re-Identification Threat (Slide 1 of 2) Latanya Sweeney currently serves as Director of the Public Interest Tech Lab at Harvard. While a graduate student at MIT, she re-identified medical information about then-Massachusetts Governor William Weld by consolidating four sources of information: 1) A newspaper report that he suffered trauma and was being treated in a local hospital. 2) Publicly available knowledge about his age, race, and sex. 3) Voter information, also publicly available, that confirmed his identity in Cambridge, Massachusetts where he lived and that his age, race, and sex were unique to him, and < next slide > Slide 8 of 24
Example of a Re-Identification Threat (Slide 2 of 2) Fourth source of information: 4) A hospital discharge summary, de-identified by not reporting clearly personally identifiable information (PII) such as name or address. His age, race, sex, and ZIP code were not suppressed. The hospital released records under the Safe Harbor provision in The Health Insurance Portability and Accountability Act of 1996 (HIPAA) to protect health records nationwide. Regrettably, it did not work in this case. The voter registration database was critical as he was (obviously) a registered voter. His demographics were unique in the registered voter database for Cambridge, Massachusetts. His demographics might have not been unique across all residents of Cambridge, Massachusetts. Slide 9 of 24
Findings from Data Re-Identification Attacks Rocher, L., Hendrickx, J.M. & de Montjoye, Y.A. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications 10, 3069 (2019). Assertion: [W]e find that 99.98% of Americans would be correctly re- identified in any dataset using 15 demographic attributes. Hayley Tsukayama, Electronic Frontier Foundation, as quoted in an article in the Washington Post dated September 26, 2021. Assertion: de-identified personal data is an oxymoron. Response: All information products from Census Bureau surveys are released with privacy protections beyond the removal of immediate identifiers. Slide 10 of 24
Findings from Database Reconstruction Attacks State of Alabama v. United States Department of Commerce, Supplemental Declaration of John M. Abowd, April 13, 2021. https://www2.census.gov/about/policies/foia/records/alabama-vs-doc/abowd-supp- declaration.pdf Finding #1: While only 1.12% of persons in large blocks (1,000+ persons) are unique on block, sex, and age, this percentage leaps to 95.06% for the smallest blocks (< 10 persons). [Table 1, Page 17] Finding #2: In a reconstruction-abetted re-identification attack on the 2010 Census using commercial data (combined assets of four suppliers, unduplicated), we confirmed our putative (suspected) reidentifications for the largest blocks at a rate of 20.93%. This rate jumps to 72.24% for the smallest blocks. [Table 2, Page 20] Finding #3: When the source of the above attack is the 2010 Census Edited File instead of commercial data, the rates are 52.59% for the largest blocks and 96.98% for the smallest blocks. Response: All data providers are challenged to responsibly disseminate information products. Slide 11 of 24
Strengths: Legacy Disclosure Methods that Protect Census Bureau Surveys A sampling of legacy methods: Cell Size, Geographic Population, and Economic Data Thresholds Controlling for Implicit Samples ( slivers in data) Cell Suppression Collapsing and Recoding Rounding Top- and bottom-coding Data swapping Data synthesis Noise injection Individually and together, these methods protect against specific types of attackers, but they do not formally manage that risk. Formal emphasizes mathematical proofs that hold in very general settings. Slide 12 of 24
Weaknesses and Opportunities (Recap) Weaknesses, as restated from before: Legacy methods do not quantify disclosure risk. Legacy methods tend to make very strong assumptions about attackers or are ambiguous about the assumptions they've made. Opportunities, restated from before: Formal privacy offers mathematically provable guarantees that quantify disclosure risk. Formal privacy allows us to disseminate results with the granularity that our stakeholders seek. Formal privacy also allows analysts to account for disclosure protection in their data analysis and inference. Slide 13 of 24
Legacy Noise Injection (Slide 1 of 2) Legacy noise injection has been used to hide very unusual characteristics of a person or household at a given point in time that is not caught by population or establishment threshold rules. Consider a married couple where each person is over 90 years old, a person who gave birth to 7 children at one time, or a person who is a practicing physician at the age of 15. All are very unusual circumstances that would probably be in the news. Targeted individuals may have their ages randomly perturbed within pre- specified cutoffs. Slide 14 of 24
Legacy Noise Injection (Slide 2 of 2) Noise is also used in longitudinal files to hide a change in a personal or household circumstance that could be found in publicly available records, for example, a birth, death, marriage, or divorce which would be reflected in a longitudinal microdata file. With multiple attacks with increased information obtained throughout the panel, the disclosure risk for any sampled individual grows. EZS Noise Addition (Evans, Zayatz, Slanta) applies multiplicative noise to magnitude data, typically for economic information products, to avoid cell suppression in tables. Slide 15 of 24
Formal and Differential Privacy (Slide 1 of 2) Formal privacy requires a framework for developing a mathematical definition of privacy. Examples are Differential Privacy (DP), Pufferfish, and Blowfish. Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith, Calibrating Noise to Sensitivity in Private Data Analysis , Conference of the Theory of Cryptography (2006). Kifer, Daniel and Ashwin Machanavajjhala, Pufferfish: A Framework for Mathematical Privacy Definitions , ACM Transactions on Database Systems,, 1. (2014). He, Xi, Ashwin Machanavajjhala, and Bolin Ding, Blowfish Policy: Tuning Privacy-Utility Trade- Offs Using Policies , SIGMOD 2014. Again, formal emphasizes mathematical proofs that hold in very general settings. New definitions of privacy may be developed to fit into this model. Formal privacy protects data from a wide variety of attackers, now and into the future. DP is a formal privacy framework that requires a privacy parameter or parameters that are related to the bound on privacy loss. It allows a meaningful tradeoff ( sweet spot ) between privacy and accuracy to be considered. Slide 16 of 24
Formal and Differential Privacy (Slide 2 of 2) DP was first implemented at the Census Bureau in 2008 for the Longitudinal Economic-Household Database (LEHD residential side only). Machanavajjhala, Ashwin, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber, Privacy: Theory Meets Practice On The Map , IEEE International Conference on Data Engineering. (2008) The 2020 Decennial Census used a process called the TopDown Algorithm (TDA), where the global tradeoff was subdivided -- and noise injected -- at several levels of geography sequentially. One could inject more noise relatively at higher or lower levels of geography. Garfinkel, Simson, Implementing Differential Privacy for the 2020 Census , Usenix Enigma. (2021) Slide 17 of 24
DP is a strong guarantee DP bounds the change in the inference made about a person or establishment, whether that person/establishment chooses to participate in the survey and is included in the data or not. The bound is quantified by a privacy loss parameter or parameters. Depending on how the parameter or parameters are set: We can yield more accurate data as we inject a small amount of noise into the data. We can yield more private data as we inject a larger amount of noise into the data. An algorithm operating on a private database of records satisfies differential privacy if the addition or removal of a single record in a database has a bounded impact on the probability distribution of outputs. If the total number of records in the database is known, then DP operates by replacing a single record with an arbitrary replacement record. Slide 18 of 24
Obstacles overcome in 2020. For the 2020 Census, we managed invariants at two levels of geography: At the state level: population counts. Demographics for persons were not kept invariant. At the block level: the number of housing units and the number of occupied group quarters by their group quarter type. The number of persons in individual housing units and the number of people in individual group quarters were not kept invariant. The noise injection process distinguished between sampling zeroes and structural zeroes. Many sampling zeroes can occur. Example: Native Hawaiian Hispanics in Rural Vermont. Structural zeroes cannot occur, such as with three year old grandmothers. Sampling zeroes were noise injected, with a nonzero probability of becoming nonzero. Structural zeroes were not noise injected, leaving their counts as they were: zeroes. The requirement for a microdata file required post-processing to remove negative, non-integer counts, and to enforce hierarchical consistency for certain geographies. For example, noise injected county population counts summed up to invariant state population counts. Slide 19 of 24
Where the work continues for our surveys Our collaboration is with international experts in formal privacy. We are researching how to implement formal privacy effective for sample surveys. Sampling weights are a challenge because one person included or excluded from a sample represents more than one person. Longitudinal panel data are a challenge because consistency across databases needs to be considered, especially when disseminating microdata that can be bridged across the panel. When one adds or removes a person, this impacts more than a single observation or a single statistic. It will impact household relationships over time: over the length of the panel. Magnitude data are a challenge because magnitude data are unbounded and highly skewed. Slide 20 of 24
We are using the GASP Rule as we develop the science and transition to formal privacy. (Slide 1 of 2) How are we managing the transition to formal privacy for our surveys? On the one hand, we have threats we need to counter. On the other hand, we are continuing to develop the science to counter those threats. GASPs -- Geographic Areas with Small Population -- are sub-national geographies smaller than the least populous U.S. Congressional District at the time that the data were collected (the most recent when in a time series). These geographic areas can be contiguous or non-contiguous. When disseminating information products, legacy methods can be used when geographies are larger than GASPs. For GASPs, noise injection methods are required, unless affected programs are on an approved exemption list. Slide 21 of 24
We are using the GASP Rule as we develop the science and transition to formal privacy. (Slide 1 of 2) Using the GASP Rule encourages programs disseminating information products with the greatest disclosure risk those with small populations to be early adopters of noise injection methods. This also encourages programs to adopt noise injection methods to have the opportunity to disseminate information products with small geographies. Slide 22 of 24
Wrap up! It is not possible to responsibly disseminate information products from any survey or census with no disclosure risk. We are transitioning our current surveys to formally manage the risk. This will protect our data against a wide variety of attackers, now and into the future. As we collaboratively develop the science, we are transitioning from legacy methods to formally private methods for all information products disseminated from our surveys. Implementing formally private noise injection for Census Bureau surveys is both a challenge and an opportunity. As the science is still developing, the implementation date has not yet been set. Implementation will magnify our strengths as we address and counter real threats to the privacy of the information products that we disseminate for our surveys at the Census Bureau. The presentations that follow provide more details on current research. Slide 23 of 24
Stay tunedand thanks! Aref Dajani: (301) 763-1797 aref.n.dajani@census.gov Slide 24 of 24
Privacy vs. Confidentiality The terms privacy and confidentiality are related, but technically distinct. Generally speaking, protecting privacy entails adherence to the full suite of Fair Information Practice Principles, and includes elements of collection and use limitation, purpose specification, and openness, among others. Confidentiality protection, more specifically, is a component of protecting privacy, and typically refers to the protection of data against unauthorized disclosure, access, or use. In the statistical and technical communities, however, privacy protection often refers specifically to the various statistical disclosure limitation methods used to protect the confidentiality of individuals data. It is this latter conception of privacy protection, specifically statistical safeguards against disclosure, that I will be using throughout this declaration when using the generic term privacy. And it is this conception of privacy protection which, for the Census Bureau, includes the methods the Bureau implements to protect the confidentiality of the census data covered by the confidentiality provisions of 13 U.S.C. 8(b) and 9. (excerpted from the Second Declaration of John M. Abowd, Fair Lines America Foundation Inc. v. U.S. Department of Commerce, fn8) Slide 25 of 24