Understanding the Trade-off between Data Utility and Disclosure Risk

Slide Note
Embed
Share

This study explores the balance between data utility and disclosure risk using a GA synthetic data generator. The authors delve into measuring utility and risk, emphasizing structured categorical data. They define synthetic data, discuss utility assessment methods, and outline how to measure data utility. Key acknowledgments and references are included, shedding light on the significance of this research.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The Trade-off between Data Utility and Disclosure Risk (using a GA Synthetic Data Generator) Yingrui Chen, Jennifer Taub, Mark Elliot The University of Manchester

  2. Acknowledgements Gillian Raab(Edinburgh) Anne-Sophie Charest (Laval) Cong Chen (Public Health England) Christine M. O'Keefe (CSIRO) Michelle Pistner Nixon(Penn State) Joshua Snoke (RAND) Aleksandra Slavkovi'c (Penn State) Duncan Smith (Univerity of Manchester) Joe Sakshaug (IAB)

  3. Outline Measuring Utility Measuring Risk The trade off NB: Focus on structured categorical data

  4. What is synthetic data? Useable synthetic/ disclosure controlled data Original Data Random Data Fully Saturated Model Pure Noise Non Negligible identification risk and attribution risk Negligible identification risk Empirical attribution risk Zone of plausible Inference

  5. X

  6. X

  7. X X X

  8. X X X

  9. Utility

  10. a n dE l l i o t (2007): InformationUtility P u r d a m T h el o s so fa n a l y t i c a lv a l i d i t ya s o c c u r r i n gw h e nad i s c l o s u r ec o n t r o l m t h ep o i n ta tw h i c hau s e rr e a c h e sa d i f f e r e n tc o n c l u s i o nf r o m a n a l y s i s (p . 1102) E v e r ys t a t i s t i c a lp r o p e r t yc a n b e c o n s i d e r e da sa ne l e m U t i l i t yi sc o n s i d e r e da st h eo b j e c t i v e o ft h eo p t i m O a n dc o m e t h o dh a sc h a n g e dad a t a s e tt o t h es a m e e n tu t i l i t y i s i n gp r o g r a m p a r a b l e b j e c t i v e ss h o u l db em e a s u r e a b l e

  11. Measuring Data Utility- Narrow Measures Frequency Tables and Cross-Tabulations Ratio of Counts (ROC) Confidence Interval Overlap (CIO) Regression Models OLS and logistic regression models compared using CIO

  12. Data Utility- Broad Measures Multiple Correspondence Analysis (MCA)- The two maps (synthetic and orginal) compared using Euclidean Distance. Propensity Score- the original and synthetic datasets are combined together into a logistic regression model wherein the following equation calculates the likelihood of the synthetic records being identified.

  13. Data Utility- Broad Measures Here weusefull contingency tabletocapturethe variate structure forall variablesin thedata. The distance between candidatesandthe original datais calculated byJensen- Shannon Divergence: Suppose ? and ? are two discrete probability distribution, DJS(?| ? is defined by: 1 2 1 2???(?| ? +1 DJS(?| ? = 2???(?| ? , where ? =1 The utility objective of GA is to minimise DJS between full contingency tables ?????? of the synthetic and original data. ??????? ? 2(? + ?) and ??? is Kullback-Leibler divergence. ??????? ? ? ?,? = ???

  14. Measuring Risk

  15. Disclosure Risk Identification disclosure risk Adataset contains identification risk if adata subjectcanbe re- identified fromthe dataset. Attribute disclosure risk Adataset contains attribution riskif sensitive information of any population units canbe inferred fromthe dataset. Either deterministically or probabilistically

  16. Differential Correct Attribution Probability Taub et al (2018) introduced a measure for disclosure risk of synthetic data called the Differential Correct Attribution Probability (DCAP),, which consists of a Correct Attribution Probability (CAP) score. ????,?= Pr ??,???,??= ?=1 ?=1 ?[??,?= ??,?,??,?= ??,?] ?[??,?= ??,?] Where ?,? state originalorsyntheticdata, the [] are Iverson brackets, ? is the number of records, ? is theindexofcase,and do is the original data and ??and ?? as vectors for the key and target information.

  17. Measuring Disclosure Risk using Targeted Correct Attribution Probability (TCAP) Intruder scenario The intruder has some information Ki on an individual that they know to be on the original dataset. They want to learn the value of some variable Ti. They have access to the synthetic dataset. They identify all the records in synthetic dataset that match Ki. If the proportion of records with the largest equivalence class on Ts|Ks meets some threshold then they infer that value for Ti. If not then they give up. TCAP captures the proportion of records for the key K that have the same target value on it s original equivalent.

  18. Using GAs to capture the Trade Off

  19. GeneticAlgorithms(GAs) Natural Computing Biological system Evolutionary Computing Genetic Algorithms Natural computing is computational systems inspired by natural systems simulating the process of organisms surviving from limited resources and predators It uses the principals of natural evolution and genotypic variation to solve complicated optimisation problems Itsimulates the process of natural evaluation including natural selection, crossover and mutation.

  20. GeneticAlgorithms(GAs) GAs can cope with multiple objectives that may conflict with each other. GAs can explorethe solution space of complex, high- dimensional problems like data synthesis.

  21. Model Design Initial Population 100candidates that are mutated from the original data, thus they are high in utility (and risk). SelectionOperator: Deterministic tournament selection operator with tournament size t = 2. i.e. 2candidates are randomly selected into tournaments (with replacement) andonlythe winner canenter the crossoveroperator.

  22. Model Design CrossoverOperator Whole-Case ParallelisedCrossover: itoccurs on every case in the candidate, the case was chosen by determined crossover rate (0.1 in this paper) and it is then switched with the corresponding case in paired candidate. MutationOperator Uniform mutation:it gives every single element/cell in the candidate a chance (0.001in this paper) to mutate.

  23. Objective Design UtilityObjective Weuse full contingencytable tocapturethe variate structure forallvariablesinthedata. Thedistancebetween candidatesandtheoriginaldatais calculatedbyJensen-Shannon Divergence: Suppose ? and ? are two discrete probability distribution, DJS(?| ? is defined by: 1 2 1 2???(?| ? +1 DJS(?| ? = 2???(?| ? , where ? =1 The utility objective of GA is to minimise DJS between full contingency tables ?????? of the synthetic and original data. 2(? + ?) and ??? is Kullback-Leibler divergence. ??????? ? ??????? ? ? ?,? = ???

  24. Experiment Data The dataset is from the 1901 Scottish Census1 and consists of 82,851 records. Process As an output we synthesised a single synthetic dataset which has ? = 0.3964 and ? = 0.1257. The process took 57 generations. Therightfigure showschanging of risk and utility from the best candidates in every generation during optimising process. 1.National Records of Scotland, (1901), 1901 Scottish Census.

  25. Comparisons to Other Synthesis Methods Wecomparedthe output(GA synthetic data) withCART andparametric generated synthetic data UtilityComparison Histogram comparison(rightfigure)

  26. Comparisons to Other Synthesis Methods propensity mean square error (pMSE) score comparison:the closer the pMSE is to 0riginal the better the data performs.In this instance all synthetic datasets have quite low pMSE scores, however the GA performs better than the CART and parametric datasets. Dataset pMSE Standardized pMSE pMSE ratio GA-syntheticData 5.44e-06 -3.9221 0.1638 CART-synthetic Data 3.397e-05 0.1106 1.0236 Parametric SyntheticData 3.17e-05 7.007e-06 0.9553

  27. Comparisons to Other Synthesis Methods RiskComparison Given that the baseline DCAP score for the univariate is 0.4154, the GA and parametric dataset would be considered no risk since they are below the baseline and the CART synthetic dataset would have minimal risk since it is very close to the baseline. Dataset Risk GA-syntheticData 0.3964 CART-syntheticData 0.4168 ParametricSyntheticData 0.3278

  28. Concluding remarks GAs are viable alternative to standard synthesisers. GAs are able to produce synthetic data that allows disclosure risk and information utility to be included in the same generation framework. GAs Current and Future work Synthetic data challenge watch out for a call! Keen to have some DP synthetic datasets this time. Bringing attribute and identification disclosure into a common framework Better general measures: Earth movers distance Sample size equivalence

  29. References Taub, J. , Elliot M. and Sakshaug, J. (2020, Accepted) The impact of synthetic data generation on data utility Transactions on Data Privacy Taub J., Elliot M., Pamparka M. and Smith D. (2018) Differential Correct Attribution Probability for Synthetic Data: An Exploration .In J. Domingo- Ferrer and F Montes (eds), Privacy in Statistical Databases (LNCS, volume 11126) 122-137. Chen, Y., Elliot, M. and Smith D. (2018) The Application of Genetic Algorithms to Data Synthesis: A Comparison of Three Crossover Methods In J. Domingo-Ferrer and F Montes (eds), Privacy in Statistical Databases LNCS, volume 11126 160-171. Taub, J. , Elliot M. Raab, G., Charest, A-S., Chen, C., Pistner, M., Snoke, J., Slavkovic,.A. (2019) Creating the best Risk-Utility Profile: The Synthetic Data Challenge

More Related Content