Optimizing Privacy in High-Dimensional Data Mining

virginia n.w
1 / 29
Embed
Share

Differential Privacy is a cutting-edge approach to safeguarding data privacy, especially in high-dimensional datasets. This proposal focuses on reducing computing complexity and improving signal-to-noise ratio by approximating the dataset distribution with low-dimensional marginals. Key research areas include constructing Bayesian networks, generating synthetic data, and exploring differential privacy mechanisms.

  • Privacy
  • Data Mining
  • High-Dimensional Data
  • Synthetic Data
  • Bayesian Networks

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Virginia Virginia http://star.aust.edu.cn/~xjfang Email: xjfang@aliyun.com

  2. C *.txt Virginia Virginia

  3. Virginia

  4. Virginia Charset[26]={ a , b , , z } =26 Coding[26]={0, 1, , 25}

  5. Virginia Virginia M=m1m2 mn mi charset, n K=k1k2 kd ki charset, d C=c1c2 cn ci charset, n

  6. Virginia cj+td=(mj+td+kj) mod 26 j=1 d, t=0 ceiling(n/d)-1 ceiling(x) x mj+td=(cj+td-kj) mod 26 j=1 d, t=0 ceiling(n/d)-1 ceiling(x) x

  7. Virginia m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 M ( ) n o t h i n g (6) i s t o (13) (14) (19) (7) (8) (13) (8) (18) (19) (14) K ( ) C ( ) j o y j o y j o y j o (9) w (22) j=1 t=0 (14) c (2) j=2 t=0 (24) r (17) j=3 t=0 (9) q (16) j=1 t=1 (14) w (22) j=2 t=1 (24) l (11) j=3 t=1 (9) p (15) j=1 t=2 (14) w (22) j=2 t=2 (24) q (16) j=3 t=2 (9) c (2) j=1 t=3 (14) c (2) j=2 t=3 n=11 d=3 t=ceiling(11/3)-1=3

  8. Differential Privacy is the state-of-the-art goal for the problem of privacy-preserving data release and privacy-preserving data mining. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods incur higher computing complexity and lower information to noise ratio, which renders the published data next to useless. This proposal aims to reduce computing complexity and signal to noise ratio. The starting point is to approximate the full distribution of high-dimensional dataset with a set of low-dimensional marginal distributions via optimizing score function and reducing sensitivity, in which generation of noisy conditional distributions with differential privacy is computed in a set of low-dimensional subspaces, and then, the sample tuples from the noisy approximation distribution are used to generate and release the synthetic dataset. Some crucial science problems would be investigated below: (i) constructing a low k-degree Bayesian network over the high-dimensional dataset via exponential mechanism in differential privacy, where the score function is optimized to reduce the sensitivity using mutual information, equivalence classes in maximum joint distribution and dynamic programming; (ii)studying the algorithm to compute a set of noisy conditional distributions from joint distributions in the subspace of Bayesian network, via the Laplace mechanism of differential privacy. (iii)exploring how to generate synthetic data from the differentially private Bayesian network and conditional distributions, without explicitly materializing the noisy global distribution. The proposed solution may have theoretical and technical significance for synthetic data generation with differential privacy on business prospects.

  9. ( ) differentialprivacyisthestateoftheartgoalfortheproblemofprivacypreservingdatarelease andprivacypreservingdataminingexistingtechniquesusingdifferentialprivacyhoweverca nnoteffectivelyhandlethepublicationofhighdimensionaldatainparticularwhentheinputd atasetcontainsalargenumberofattributesexistingmethodsincurhighercomputingcomple xityandlowerinformationtonoiseratiowhichrendersthepublisheddatanexttouselessthisp roposalaimstoreducecomputingcomplexityandsignaltonoiseratiothestartingpointistoap proximatethefulldistributionofhighdimensionaldatasetwithasetoflowdimensionalmargi naldistributionsviaoptimizingscorefunctionandreducingsensitivityinwhichgenerationof noisyconditionaldistributionswithdifferentialprivacyiscomputedinasetoflowdimensiona lsubspacesandthenthesampletuplesfromthenoisyapproximationdistributionareusedto generateandreleasethesyntheticdatasetsomecrucialscienceproblemswouldbeinvestiga tedbelowiconstructingalowkdegreebayesiannetworkoverthehighdimensionaldatasetvi aexponentialmechanismindifferentialprivacywherethescorefunctionisoptimizedtoredu cethesensitivityusingmutualinformationequivalenceclassesinmaximumjointdistributio nanddynamicprogrammingiistudyingthealgorithmtocomputeasetofnoisyconditionaldis tributionsfromjointdistributionsinthesubspaceofbayesiannetworkviathelaplacemechan ismofdifferentialprivacyiiiexploringhowtogeneratesyntheticdatafromthedifferentiallypr ivatebayesiannetworkandconditionaldistributionswithoutexplicitlymaterializingthenois yglobaldistributiontheproposedsolutionmayhavetheoreticalandtechnicalsignificancefo rsyntheticdatagenerationwithdifferentialprivacyonbusinessprospects

  10. virginia key=infosec lvktwvgvgnodttqifqqmubujglevmbkhziczglcsphweyvwttwoqseshxenjsgaxejgwvxqalrsxczrqsswgiaid jmxipddjiumeawfkfigfaarkvtjlawvqalhwgjvvviwwwavsuvmhnrwsfxkiyufazcklmcoixmehofrqbrktwg vqijzqlcvqqsllgxhgzagcbvtbgjjqtmraqgvfncfenlnyoarrieywuyniebvwrvprnbhyvlnyokivkbshsmpanqo jkgvhrpwvqnnyhjmdcgjgwbkagnbyqgbutrkmpkhwvakjmehcetwbvsuusoxyjlaxaiaizgagzvstgvoigncf xqvbngwvcbvtkzmepejbvitagmshydtvxvwhfigfbwbvbbzgwpgafyvawrzbuckenivrglstmqzqwgquczha rikbrddizqgdofhuqtsodxqvbngwvcbvthziubnwharixbnblmubbfdhvqfvrolivprkidpfqfyfafwbvtbgjjqt mraqgvfncfenlnyokivevyvswgbbkzgafqzjbkmqvnqasviqafzvmubenpmxkwaxjaeqxgnaadkvtxqgvgnh sqlmqvnsrjifcpnbywgvfnhazkblnbolkkulsfitigncfshvbngqgqvqnhaspiyiwkxtqozhaspajnhzhknsjfwrv qnqdjmxipdwkgquczhwhkvnxslshtbbraqgvfncfenahggheemffbvxjmayvwwcucqslyrtrxtjsobujbgmu gnudjszqzfhasplvxhjmdcgncfetmhxsvxqorssjevmnsrjinmnxsllgalshzivqpioleumgxceiezhhwspukvjbu irzbgzwquebzzvfgqaaskxkonysvfgtbbwuspagwiuxkvtfzgamlrlfwidiljgaepvrykgvmwijfllgpvlvvmomax wgrctqfhswgbinowbrwajblmctzjqzepqfrwfhknsjfwrvqnqdjmxipdkzitmgmskgqzrkifgvqbswksrbvrwr ifbbwsvyemgmskipavywnmvghxwfkocgzodmpnbwasxkwajemmxiyjbuietnxgwwkvzflaqwuwtwfxfqf yfafwbvtbsrfllsoemexetujeouvsuamubhimaribujodkqzvyvexqkbrdmxgifjhgjpwvxmusplvywgrctqng lvkjhywgrunetabskvgiwkxtqozhaspavshziucoxdsggwsgoqiuqnsbwxywepjaevprqohpckrrsulcvvxagjf qsksjipbvfzhvkdnhmamkmkuzgvkvtmcoxqorssjevmfdbllgbvhrsxcnetallglvktwvgvgnodpaxenjsxgjnd skmcvajhostsnsrusplvywgrctqnglvkjhywgruevyvgyvmkuzagkbydasxgzvfzadkvtyvwrqqfdudsdiyiwkx tqozhaspbujdjsrwfjrksncgncfqcgufjwxjmbwslmeiyfbvxgkuswuenavlbajkknsqwjqzfdbllgbvhrsxcorss jevqbskaxjlvktwvgvgnodttqifqqspjhxwfiuacwcktgkgxvzlj

  11. Virginia

  12. n i pi(1 i n) IC = IC xi i L n n IC p i = 1 i IC IC. IC IC ( ( 1) 1) x x L L n = i i ' IC = 1 i

  13. IC 1. IC 0.038. 2. IC 0.065. 3. IC 3

  14. Example 1: IC IC 0.0388

  15. key=17 IC IC 0.0388

  16. Example 2: text Differential Privacy is the state-of-the-art goal for the problem of privacy-preserving data release and privacy-preserving data mining. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods incur higher computing complexity and lower information to noise ratio, which renders the published data next to useless. This proposal aims to reduce computing complexity and signal to noise ratio. The starting point is to approximate the full distribution of high-dimensional dataset with a set of low-dimensional marginal distributions via optimizing score function and reducing sensitivity, in which generation of noisy conditional distributions with differential privacy is computed in a set of low-dimensional subspaces, and then, the sample tuples from the noisy approximation distribution are used to generate and release the synthetic dataset. Some crucial science problems would be investigated below: (i) constructing a low k-degree Bayesian network over the high-dimensional dataset via exponential mechanism in differential privacy, where the score function is optimized to reduce the sensitivity using mutual information, equivalence classes in maximum joint distribution and dynamic programming; (ii)studying the algorithm to compute a set of noisy conditional distributions from joint distributions in the subspace of Bayesian network, via the Laplace mechanism of differential privacy. (iii)exploring how to generate synthetic data from the differentially private Bayesian network and conditional distributions, without explicitly materializing the noisy global distribution. The proposed solution may have theoretical and technical significance for synthetic data generation with differential privacy on business prospects. IC 0.0659

  17. Virginia Virginia step1 Virginia step2

  18. ( ) differentialprivacyisthestateoftheartgoalfortheproblemofprivacypreservingdatarelease andprivacypreservingdataminingexistingtechniquesusingdifferentialprivacyhoweverca nnoteffectivelyhandlethepublicationofhighdimensionaldatainparticularwhentheinputd atasetcontainsalargenumberofattributesexistingmethodsincurhighercomputingcomple xityandlowerinformationtonoiseratiowhichrendersthepublisheddatanexttouselessthisp roposalaimstoreducecomputingcomplexityandsignaltonoiseratiothestartingpointistoap proximatethefulldistributionofhighdimensionaldatasetwithasetoflowdimensionalmargi naldistributionsviaoptimizingscorefunctionandreducingsensitivityinwhichgenerationof noisyconditionaldistributionswithdifferentialprivacyiscomputedinasetoflowdimensiona lsubspacesandthenthesampletuplesfromthenoisyapproximationdistributionareusedto generateandreleasethesyntheticdatasetsomecrucialscienceproblemswouldbeinvestiga tedbelowiconstructingalowkdegreebayesiannetworkoverthehighdimensionaldatasetvi aexponentialmechanismindifferentialprivacywherethescorefunctionisoptimizedtoredu cethesensitivityusingmutualinformationequivalenceclassesinmaximumjointdistributio nanddynamicprogrammingiistudyingthealgorithmtocomputeasetofnoisyconditionaldis tributionsfromjointdistributionsinthesubspaceofbayesiannetworkviathelaplacemechan ismofdifferentialprivacyiiiexploringhowtogeneratesyntheticdatafromthedifferentiallypr ivatebayesiannetworkandconditionaldistributionswithoutexplicitlymaterializingthenois yglobaldistributiontheproposedsolutionmayhavetheoreticalandtechnicalsignificancefo rsyntheticdatagenerationwithdifferentialprivacyonbusinessprospects

  19. virginia key=infosec lvktwvgvgnodttqifqqmubujglevmbkhziczglcsphweyvwttwoqseshxenjsgaxejgwvxqalrsxczrqsswgiaid jmxipddjiumeawfkfigfaarkvtjlawvqalhwgjvvviwwwavsuvmhnrwsfxkiyufazcklmcoixmehofrqbrktwg vqijzqlcvqqsllgxhgzagcbvtbgjjqtmraqgvfncfenlnyoarrieywuyniebvwrvprnbhyvlnyokivkbshsmpanqo jkgvhrpwvqnnyhjmdcgjgwbkagnbyqgbutrkmpkhwvakjmehcetwbvsuusoxyjlaxaiaizgagzvstgvoigncf xqvbngwvcbvtkzmepejbvitagmshydtvxvwhfigfbwbvbbzgwpgafyvawrzbuckenivrglstmqzqwgquczha rikbrddizqgdofhuqtsodxqvbngwvcbvthziubnwharixbnblmubbfdhvqfvrolivprkidpfqfyfafwbvtbgjjqt mraqgvfncfenlnyokivevyvswgbbkzgafqzjbkmqvnqasviqafzvmubenpmxkwaxjaeqxgnaadkvtxqgvgnh sqlmqvnsrjifcpnbywgvfnhazkblnbolkkulsfitigncfshvbngqgqvqnhaspiyiwkxtqozhaspajnhzhknsjfwrv qnqdjmxipdwkgquczhwhkvnxslshtbbraqgvfncfenahggheemffbvxjmayvwwcucqslyrtrxtjsobujbgmu gnudjszqzfhasplvxhjmdcgncfetmhxsvxqorssjevmnsrjinmnxsllgalshzivqpioleumgxceiezhhwspukvjbu irzbgzwquebzzvfgqaaskxkonysvfgtbbwuspagwiuxkvtfzgamlrlfwidiljgaepvrykgvmwijfllgpvlvvmomax wgrctqfhswgbinowbrwajblmctzjqzepqfrwfhknsjfwrvqnqdjmxipdkzitmgmskgqzrkifgvqbswksrbvrwr ifbbwsvyemgmskipavywnmvghxwfkocgzodmpnbwasxkwajemmxiyjbuietnxgwwkvzflaqwuwtwfxfqf yfafwbvtbsrfllsoemexetujeouvsuamubhimaribujodkqzvyvexqkbrdmxgifjhgjpwvxmusplvywgrctqng lvkjhywgrunetabskvgiwkxtqozhaspavshziucoxdsggwsgoqiuqnsbwxywepjaevprqohpckrrsulcvvxagjf qsksjipbvfzhvkdnhmamkmkuzgvkvtmcoxqorssjevmfdbllgbvhrsxcnetallglvktwvgvgnodpaxenjsxgjnd skmcvajhostsnsrusplvywgrctqnglvkjhywgruevyvgyvmkuzagkbydasxgzvfzadkvtyvwrqqfdudsdiyiwkx tqozhaspbujdjsrwfjrksncgncfqcgufjwxjmbwslmeiyfbvxgkuswuenavlbajkknsqwjqzfdbllgbvhrsxcorss jevqbskaxjlvktwvgvgnodttqifqqspjhxwfiuacwcktgkgxvzlj

  20. step1 (1) 2 IC (2) 3 IC (3) n IC d , IC 0.065 Virginia d

  21. Example: ciphertext2 2 IC=0.0419

  22. Example: ciphertext3 3 IC=0.0419

  23. Example: ciphertext7 1 lvqbmzwwxxqziimivqvanikmbqvxbqvliiplkavncabkmbxizivbpatibazimukqqvbbxbfpqbqvlebqv qbwxvnvcvbkivviqanqiuvtvammutbgqlcmommaqmzkzeqotavlivwpmtbwtqnqimzqbbmagcn witvuqblxubbzkiwltjnvqacwqwpkvqbdmvombnlvxjvsltjembzvqiqbwcgmikakzboqlvqjak vgiubgeoeearapegtavvryleriqhvtfneernbnhngguhevyavgbvegvgbfbvqcbgtbvnbbvrfvtfnvbzna eagthnpflugbqyojsnpcnbfhfacrunzvghrnnlpghvbbanbgtrlrivaqiazfsnpgrbvbgvhgbaynzwfvlev huvbfvvqhegovosnerrvsvnktrfvevgenanvqhvkyvtfyoufgubyuvnfvrbvgihcg knfjklyqnjlqidafjlvswumhkjqgtmnyybnysqryjntwhsjisnntjmxfzyurzzrdsntwnfrkytmnyykjqfnxn xssnnnlnnniznjqdzxbngfyqxjufxnxssxsixhjgzaybwfljyjlxfnjjrjqdmksrwmyxzwjjxftytstsijyrjxynyt izsxgspqrxkfhumsdhtknndjsynyyudfydizjjnfwfslsdhssknfxwx 2 3 7 gtuvchthaxcgxufkvjwhkcxqvcgcjgnrnvvvpgqdkgpjwoagoqcetdfvgrntqizuqcuiuqvfwjgnvgfqiuk qkgqfgkkthqptpkvxqkhgnejcrouzpdtqvngvueurugkgpkmdpmgocgrcpkvxtqvrfepvopkxekwfwf eouiqqgppckuktpuguyvccfpkkkqvgcggagctpckuvkgkqdtprncjegnkqgcvjgtpug 7 IC=0.0657

  24. IC 1 2 3 4 5 6 7 IC IC IC IC IC IC IC IC 1 1609 0.0419 0.0419 2 805 0.0427 804 0.0411 0.0419 3 537 0.0417 536 0.0417 536 0.0424 0.0419 4 403 0.0425 402 0.0.98 402 0.0424 402 0.0427 0.0419 5 322 0.0417 322 0.0414 322 0.0418 322 0.0413 321 0.0411 0.0415 6 269 0.0402 268 0.0397 268 0.0441 268 0.0432 268 0.0419 268 0.0416 0.0418 7 230 0.0674 230 0.0677 230 0.0621 230 0.0584 230 0.0744 230 0.0666 229 0.0634 0.0657 8 0.0422 IC 0.065, IC IC 0.065 d=7.

  25. step2 (1) Virginia , i(i=1 d) key i 26 26 ( ) Virginia (2 1 d

  26. n i pi(i=1 n) Cj( j=1 d) Cj ni,j j fi,j f n = i , = = i j * , 1 M p j d j i n 1 , i j

  27. (pi)

  28. Example: 326 3 3 3 knfjklyqnjlqidafjlvswumhkjqgtmnyybnysqryjntwhsjisnntjmxfzyurzzrdsntwnf rkytmnyykjqfnxnxssnnnlnnniznjqdzxbngfyqxjufxnxssxsixhjgzaybwfljyjlxfnjjrj qdmksrwmyxzwjjxftytstsijyrjxynytizsxgspqrxkfhumsdhtknndjsynyyudfydizjjn fwfslsdhssknfxwx 1(b) 0.0387 14 0.0326 2(c) 0.0325 15 0.0348 3(d) 0.0324 16 0.0416 4(e) 0.0368 17 0.0392 26 26 0.0332 20 0.0461 Virginia f 3 5 (f) 0.0615 18 0.0405 6 0.0433 19 0.0361 7 8 0.0279 21 0.0386 9 0.0468 22 0.0356 7 Virginia 10 0.0384 23 0.0313 11 0.0365 infosec 24 0.0364 12 0.0356 25 0.0429 13 0.0368 26(a) 0.0340

  29. The End Thank you!

Related


More Related Content