Ensuring Privacy in Microdata Publishing

Publishing Microdata with a

Robust Privacy Guarantee

Jianneng Cao,

National University of Singapore, now at I

Panagiotis Karras,

 Rutgers University

Table 2.

Voter registration list

Quasi-identifier (QI):

Non-sensitive

 attribute set like {

Age, Sex,

Zipcode

}, linkable to external data to

re-identify

  individuals

Sensitive attribute (SA):

Sensitive

 attribute like

Disease

undesirable to

be linked to an individual

Background: QI & SA

QI space

•

An EC

–

Minimum bounding box

(MBR)

–

Smaller MBR; less

distortion

Background: EC & information loss

Background:

-anonymity &

-diversity

•

-anonymity

: An EC should contain at least

 tuples

‐

 Table 3 is 3-anonymous

‐

 Prone to homogeneity attack

•

-diversity

: … at least

“well represented”

 SA values

Background: limitations of

-diversity

-diversity does

not

 consider

unavoidable

 background

knowledge:

SA distribution in whole table

(High diversity!)

•

-closeness (the most recent privacy model) [1] :

–

SA = {v

, v

, …, v

–

=(p

, p

, …, p

): SA distribution in the whole table

•

Prior knowledge

–

=(q

, q

, …, q

): SA distribution in an EC

•

Posterior knowledge

–

Distance (P, Q) ≤

•

Information gain after seeing an EC

Background:

-closenesss and EMD

[1] Li et al.

-closeness: Privacy beyond

-anonymity and

-diversity.

ICDE

, 2007

•

Earth Mover’s Distance (EMD):

–

, set of “

holes

”

–

, piles of “

earth

”

–

EMD is the

minimum work to fill

by

Limitations of

-closeness

-closeness cannot translate

 into clear privacy guarantee

Relative individual distances between

and

 are not clear.

-closeness instantiation, EMD

[1]

By EMD, both cases assume the same privacy

[1] Li et al. t-closeness: Privacy beyond k-anonymity and

-diversity.

ICDE

, 2007.

β

-likeness

≤

Lowers correlation between a person and

Privacy enhanced

We focus on

Distance function

Attempt 2:

Attempt 3:

An observation

•

-likeness:

1 EC with all tuples

–

Low information quality

•

1-likeness: 2

ECs

–

igher information quality

–

Higher

rivacy loss for

β

 ≥ 1

BUREL

4 Gastric ulcer

4 Intestinal cancer

β

= 2

2/19 +3/19<f(2/19)≈0.31

3/19 +3/19<f(3/19)≈0.45

4/19 +4/19<f(4/19)≈0.54

Step 1: Bucketization

Step 2: Reallocation

Step 3: Populate ECs

Tuples drawn proportionally

to bucket sizes

Determines # of tuples each EC gets from each bucket in top-down splitting process

approximately obeying proportionality; terminates when eligibility violated

Process guided by information loss considerations

More material in paper

•

Perturbation-based scheme.

•

Arguments about resistance to attacks.

•

CENSUS data set:

–

Real, 500,000 tuples, 5 QI attributes, 1 SA

•

SABRE

tMondrian [1]

–

Under same

-closeness (info loss)

–

BUREL: higher privacy in terms of

β

-likeness

•

Benchmarks

–

Extended from [2]

–

BUREL: best info quality & fastest

Summary of experiments

[1] Li et al. Closeness: A new privacy measure for data publishing.

TKDE

, 2010

[2] LeFevre et al. Mondrian Multidimensional K-Anonymity.

ICDE

Figure.

 Comparison to

-closeness

•

(a) Given

β

 and dataset DB

–

BUREL(DB,

β

)=DB

β

, following t

β

-closeness

–

All schemes are

β

-closeness

–

Comparison

 in terms of

β

-likeness

•

(b) Given t and DB

–

BUREL finds

β

 by binary search

–

BUREL(DB,

β

) follows t-closeness

–

All schemes are t-closeness

–

Comparison

 in terms of

β

-likeness

•

(c) Given AIL (average information loss) and DB

–

All schemes have same AIL

–

Comparison

 in terms of

β

-likeness

LMondrian:

extension of Mondrian for

β

-likeness

DMondrian:

extension of

δ

-disclosure to support

β

-likeness

BUREL

clearly outperforms the others

Conclusion

•

Robust model for microdata anonymization.

•

Comprehensible privacy guarantee.

•

Can withstand attacks proposed in previous

research.

Thank you!  Questions?

Thank you!  Questions?

-closeness instantiation, KL/JS-divergence

[1] D. Rebollo-Monedero et al. From t-closeness-like privacy to postrandomization via information theory. TKDE 2010.

[2] N. Li et al. Closeness: A new privacy measure for data publishing. TKDE 2010.

Privacy: Case 2 is

higher

 than Case 1

δ

-disclosure [1]

But:

Clear privacy guarantee defined on

individual

 SA values

[1] J. Brickell et al. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In KDD, 2008.

Slide Note

Embed Share

Download

This research explores various privacy protection techniques for publishing microdata, focusing on k-anonymity, l-diversity, t-closeness, and EMD models. It discusses the challenges and limitations of each method, highlighting the need for robust privacy guarantees in the handling of sensitive information. The study emphasizes the importance of safeguarding individual data while still allowing for meaningful data analysis.

mala_9 Follow

Uploaded on Sep 24, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Publishing Microdata with a Robust Privacy Guarantee Jianneng Cao, National University of Singapore, now at I2R Panagiotis Karras, Rutgers University

Background: QI & SA Table 1. Microdata about patients Table 2. Voter registration list Quasi-identifier (QI): Non-sensitive attribute set like {Age, Sex, Zipcode}, linkable to external data to re-identify individuals Sensitive attribute (SA):Sensitive attribute like Disease, undesirable to be linked to an individual

Background: EC & information loss Equivalence class (EC): A group of records with the same QI values Table 3. Anonymized data in Table 1 QI space Sex An EC Minimum bounding box (MBR) Smaller MBR; less distortion EC 2 Female Male 25 28 53711 Age 53712 Zipcode

Background: k-anonymity & -diversity Equivalence class (EC): A group of records with the same QI values Table 3. Anonymized data in Table 1 k-anonymity: An EC should contain at least k tuples Table 3 is 3-anonymous Prone to homogeneity attack -diversity: at least well represented SA values

Background: limitations of -diversity (High diversity!) -diversity does not consider unavoidable background knowledge: SA distribution in whole table Table 4. A 3-diverse table

Background: t-closenesss and EMD t-closeness (the most recent privacy model) [1] : SA = {v1, v2, , vm} P=(p1, p2, , pm): SA distribution in the whole table Prior knowledge Q=(q1, q2, , qm): SA distribution in an EC Posterior knowledge Distance (P, Q) t Information gain after seeing an EC Earth Mover s Distance (EMD): P, set of holes Q, piles of earth EMD is the minimum work to fill P by Q [1] Li et al. t-closeness: Privacy beyond k-anonymity and -diversity. ICDE, 2007

Limitations of t-closeness Relative individual distances between pj and qj are not clear. t-closeness cannot translate t into clear privacy guarantee

t-closeness instantiation, EMD [1] Case 1: Case 2: By EMD, both cases assume the same privacy However [1] Li et al. t-closeness: Privacy beyond k-anonymity and -diversity. ICDE, 2007.

-likeness qi pi Lowers correlation between a person and pi Privacy enhanced We focus on qi > pi

Distance function Attempt 2: Attempt 3: Attempt 1:

An observation B1 B2 B3 0-likeness: 1 EC with all tuples Low information quality 1-likeness: 2 ECs Higher information quality Higher privacy loss for 1

BUREL = 2 3/19 +3/19<f(3/19) 0.45 B1 B2 B3 2 SARS 3 Pneumonia 3 Bronchitis 3 Hepatitis 4 Gastric ulcer 4 Intestinal cancer x1 x2 x3 2/19 +3/19<f(2/19) 0.31 4/19 +4/19<f(4/19) 0.54 Tuples drawn proportionally to bucket sizes Step 1: Bucketization Step 2: Reallocation Determines # of tuples each EC gets from each bucket in top-down splitting process approximately obeying proportionality; terminates when eligibility violated Step 3: Populate ECs Process guided by information loss considerations Build partition satisfying this condition by DP

More material in paper Perturbation-based scheme. Arguments about resistance to attacks.

Summary of experiments CENSUS data set: Real, 500,000 tuples, 5 QI attributes, 1 SA SABRE & tMondrian [1]: Under same t-closeness (info loss) BUREL: higher privacy in terms of -likeness Benchmarks Extended from [2] BUREL: best info quality & fastest [1] Li et al. Closeness: A new privacy measure for data publishing. TKDE, 2010 [2] LeFevre et al. Mondrian Multidimensional K-Anonymity. ICDE 2006

Figure. Comparison to t-closeness (c) Given AIL (average information loss) and DB All schemes have same AIL (a) Given and dataset DB BUREL(DB, )=DB , following t -closeness All schemes are t -closeness Comparison in terms of -likeness All schemes are t-closeness Comparison in terms of -likeness (b) Given t and DB BUREL finds t by binary search BUREL(DB, t) follows t-closeness Comparison in terms of -likeness

LMondrian: extension of Mondrian for -likeness DMondrian: extension of -disclosure to support -likeness BUREL clearly outperforms the others

Conclusion Robust model for microdata anonymization. Comprehensible privacy guarantee. Can withstand attacks proposed in previous research.

Thank you! Questions?

t-closeness instantiation, KL/JS-divergence Case 1: Case 2: Case 1: 0.0290 (0.0073) Case 2: 0.0133 (0.0038) Privacy: Case 2 is higher than Case 1 But [1] D. Rebollo-Monedero et al. From t-closeness-like privacy to postrandomization via information theory. TKDE 2010. [2] N. Li et al. Closeness: A new privacy measure for data publishing. TKDE 2010.

-disclosure [1] Clear privacy guarantee defined on individual SA values But: [1] J. Brickell et al. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In KDD, 2008.

Ensuring Privacy in Microdata Publishing

Download Presentation

Presentation Transcript

Related

More Related Content