Ensuring Privacy in Microdata Publishing

Publishing Microdata with a
Robust Privacy Guarantee
Jianneng Cao,
 
National University of Singapore, now at I
2
R
Panagiotis Karras,
 Rutgers University
 
Table 2. 
Voter registration list
 
Quasi-identifier (QI):
 
Non-sensitive
 attribute set like {
Age, Sex,
Zipcode
}, linkable to external data to 
re-identify
  individuals
 
Sensitive attribute (SA):
  
Sensitive
 attribute like 
Disease
, 
undesirable to
be linked to an individual
Background: QI & SA
 
QI space
 
An EC
Minimum bounding box
(MBR)
Smaller MBR; less
distortion
Background: EC & information loss
Background: 
k
-anonymity & 
l
-diversity
 
 k
-anonymity
: An EC should contain at least 
k
 tuples
 Table 3 is 3-anonymous
 Prone to homogeneity attack
 
 
l
-diversity
: … at least 
l  
“well represented”
 SA values
Background: limitations of 
l
-diversity
 
l
-diversity does 
not
 consider 
unavoidable
 background
knowledge: 
SA distribution in whole table
 
(High diversity!)
 
t
-closeness (the most recent privacy model) [1] :
SA = {v
1
, v
2
, …, v
m
}
P
=(p
1
, p
2
, …, p
m
): SA distribution in the whole table
Prior knowledge
Q
=(q
1
, q
2
, …, q
m
): SA distribution in an EC
Posterior knowledge
Distance (P, Q) ≤ 
t
Information gain after seeing an EC
Background:
 t
-closenesss and EMD
 
[1] Li et al. 
t
-closeness: Privacy beyond 
k
-anonymity and 
l
-diversity. 
ICDE
, 2007
 
Earth Mover’s Distance (EMD):
P
, set of “
holes
Q
, piles of “
earth
EMD is the 
minimum work to fill 
P
 by 
Q
Limitations of 
t
-closeness
 
t
-closeness cannot translate 
t
 into clear privacy guarantee
Relative individual distances between 
p
j
 and 
q
j
 are not clear.
t
-closeness instantiation, EMD 
[1]
 
By EMD, both cases assume the same privacy
 
[1] Li et al. t-closeness: Privacy beyond k-anonymity and 
l
-diversity. 
ICDE
, 2007.
β
-likeness
 
q
i
p
i
 
Lowers correlation between a person and 
p
i
 
Privacy enhanced
 
We focus on 
q
i
 > 
p
i
Distance function
 
Attempt 2:
 
Attempt 3:
An observation
 
0
-likeness: 
1 EC with all tuples
Low information quality
 
1-likeness: 2
 ECs
H
igher information quality
Higher
 p
rivacy loss for 
β
 ≥ 1
BUREL
4 Gastric ulcer
4 Intestinal cancer
β
 = 2
 
2/19 +3/19<f(2/19)≈0.31
 
3/19 +3/19<f(3/19)≈0.45
 
4/19 +4/19<f(4/19)≈0.54
Step 1: Bucketization
 
Step 2: Reallocation
 
Step 3: Populate ECs
 
Tuples drawn proportionally
to bucket sizes
 
Determines # of tuples each EC gets from each bucket in top-down splitting process
approximately obeying proportionality; terminates when eligibility violated
 
Process guided by information loss considerations
 
More material in paper
 
Perturbation-based scheme.
Arguments about resistance to attacks.
 
CENSUS data set:
Real, 500,000 tuples, 5 QI attributes, 1 SA
SABRE
 & 
tMondrian [1]
:
Under same 
t
-closeness (info loss)
BUREL: higher privacy in terms of 
β
-likeness
Benchmarks
Extended from [2]
BUREL: best info quality & fastest
 
Summary of experiments
 
[1] Li et al. Closeness: A new privacy measure for data publishing. 
TKDE
, 2010
[2] LeFevre et al. Mondrian Multidimensional K-Anonymity. 
ICDE
  2006
Figure.
 Comparison to 
t
-closeness
 
(a) Given 
β
 and dataset DB
BUREL(DB, 
β
)=DB
β
, following t
β
-closeness
All schemes are 
t
β
-closeness
Comparison
 in terms of 
β
-likeness
 
(b) Given t and DB
BUREL finds 
β
t
 by binary search
BUREL(DB, 
β
t
) follows t-closeness
All schemes are t-closeness
Comparison
 in terms of 
β
-likeness
 
(c) Given AIL (average information loss) and DB
All schemes have same AIL
Comparison
 in terms of 
β
-likeness
 
LMondrian: 
extension of Mondrian for 
β
-likeness
DMondrian: 
extension of 
δ
-disclosure to support 
β
-likeness
BUREL 
clearly outperforms the others
 
Conclusion
 
Robust model for microdata anonymization.
Comprehensible privacy guarantee.
Can withstand attacks proposed in previous
research.
 
Thank you!  Questions?
Thank you!  Questions?
t
-closeness instantiation, KL/JS-divergence
 
[1] D. Rebollo-Monedero et al. From t-closeness-like privacy to postrandomization via information theory. TKDE 2010.
 
[2] N. Li et al. Closeness: A new privacy measure for data publishing. TKDE 2010.
 
Privacy: Case 2 is 
higher
 than Case 1
δ
-disclosure [1]
 
But:
 
Clear privacy guarantee defined on 
individual
 SA values
[1] J. Brickell et al. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In KDD, 2008.
Slide Note
Embed
Share

This research explores various privacy protection techniques for publishing microdata, focusing on k-anonymity, l-diversity, t-closeness, and EMD models. It discusses the challenges and limitations of each method, highlighting the need for robust privacy guarantees in the handling of sensitive information. The study emphasizes the importance of safeguarding individual data while still allowing for meaningful data analysis.

  • Privacy
  • Microdata Publishing
  • K-anonymity
  • L-diversity
  • T-closeness

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Publishing Microdata with a Robust Privacy Guarantee Jianneng Cao, National University of Singapore, now at I2R Panagiotis Karras, Rutgers University

  2. Background: QI & SA Table 1. Microdata about patients Table 2. Voter registration list Quasi-identifier (QI): Non-sensitive attribute set like {Age, Sex, Zipcode}, linkable to external data to re-identify individuals Sensitive attribute (SA):Sensitive attribute like Disease, undesirable to be linked to an individual

  3. Background: EC & information loss Equivalence class (EC): A group of records with the same QI values Table 3. Anonymized data in Table 1 QI space Sex An EC Minimum bounding box (MBR) Smaller MBR; less distortion EC 2 Female Male 25 28 53711 Age 53712 Zipcode

  4. Background: k-anonymity & -diversity Equivalence class (EC): A group of records with the same QI values Table 3. Anonymized data in Table 1 k-anonymity: An EC should contain at least k tuples Table 3 is 3-anonymous Prone to homogeneity attack -diversity: at least well represented SA values

  5. Background: limitations of -diversity (High diversity!) -diversity does not consider unavoidable background knowledge: SA distribution in whole table Table 4. A 3-diverse table

  6. Background: t-closenesss and EMD t-closeness (the most recent privacy model) [1] : SA = {v1, v2, , vm} P=(p1, p2, , pm): SA distribution in the whole table Prior knowledge Q=(q1, q2, , qm): SA distribution in an EC Posterior knowledge Distance (P, Q) t Information gain after seeing an EC Earth Mover s Distance (EMD): P, set of holes Q, piles of earth EMD is the minimum work to fill P by Q [1] Li et al. t-closeness: Privacy beyond k-anonymity and -diversity. ICDE, 2007

  7. Limitations of t-closeness Relative individual distances between pj and qj are not clear. t-closeness cannot translate t into clear privacy guarantee

  8. t-closeness instantiation, EMD [1] Case 1: Case 2: By EMD, both cases assume the same privacy However [1] Li et al. t-closeness: Privacy beyond k-anonymity and -diversity. ICDE, 2007.

  9. -likeness qi pi Lowers correlation between a person and pi Privacy enhanced We focus on qi > pi

  10. Distance function Attempt 2: Attempt 3: Attempt 1:

  11. An observation B1 B2 B3 0-likeness: 1 EC with all tuples Low information quality 1-likeness: 2 ECs Higher information quality Higher privacy loss for 1

  12. BUREL = 2 3/19 +3/19<f(3/19) 0.45 B1 B2 B3 2 SARS 3 Pneumonia 3 Bronchitis 3 Hepatitis 4 Gastric ulcer 4 Intestinal cancer x1 x2 x3 2/19 +3/19<f(2/19) 0.31 4/19 +4/19<f(4/19) 0.54 Tuples drawn proportionally to bucket sizes Step 1: Bucketization Step 2: Reallocation Determines # of tuples each EC gets from each bucket in top-down splitting process approximately obeying proportionality; terminates when eligibility violated Step 3: Populate ECs Process guided by information loss considerations Build partition satisfying this condition by DP

  13. More material in paper Perturbation-based scheme. Arguments about resistance to attacks.

  14. Summary of experiments CENSUS data set: Real, 500,000 tuples, 5 QI attributes, 1 SA SABRE & tMondrian [1]: Under same t-closeness (info loss) BUREL: higher privacy in terms of -likeness Benchmarks Extended from [2] BUREL: best info quality & fastest [1] Li et al. Closeness: A new privacy measure for data publishing. TKDE, 2010 [2] LeFevre et al. Mondrian Multidimensional K-Anonymity. ICDE 2006

  15. Figure. Comparison to t-closeness (c) Given AIL (average information loss) and DB All schemes have same AIL (a) Given and dataset DB BUREL(DB, )=DB , following t -closeness All schemes are t -closeness Comparison in terms of -likeness All schemes are t-closeness Comparison in terms of -likeness (b) Given t and DB BUREL finds t by binary search BUREL(DB, t) follows t-closeness Comparison in terms of -likeness

  16. LMondrian: extension of Mondrian for -likeness DMondrian: extension of -disclosure to support -likeness BUREL clearly outperforms the others

  17. Conclusion Robust model for microdata anonymization. Comprehensible privacy guarantee. Can withstand attacks proposed in previous research.

  18. Thank you! Questions?

  19. t-closeness instantiation, KL/JS-divergence Case 1: Case 2: Case 1: 0.0290 (0.0073) Case 2: 0.0133 (0.0038) Privacy: Case 2 is higher than Case 1 But [1] D. Rebollo-Monedero et al. From t-closeness-like privacy to postrandomization via information theory. TKDE 2010. [2] N. Li et al. Closeness: A new privacy measure for data publishing. TKDE 2010.

  20. -disclosure [1] Clear privacy guarantee defined on individual SA values But: [1] J. Brickell et al. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In KDD, 2008.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#