skip to main content
research-article

PrivBayes: private data release via bayesian networks

Published: 18 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art goal for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.
    To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

    References

    [1]
    Integrated public use microdata series international. https://international.ipums.org.
    [2]
    Statlib. http://lib.stat.cmu.edu/.
    [3]
    Transaction processing performance council. http://www.tpc.org.
    [4]
    K. Bache and M. Lichman. UCI machine learning repository, 2013.
    [5]
    B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, pages 273--282, 2007.
    [6]
    R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive data. In KDD, pages 503--512, 2010.
    [7]
    C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. page 27, 2011.
    [8]
    K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, pages 289--296, 2008.
    [9]
    K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069--1109, 2011.
    [10]
    D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of bayesian networks is NP-Hard. Journal of Machine Learning Research, 5:1287--1330, 2004.
    [11]
    C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462--467, 1968.
    [12]
    G. Cormode, C. M. Procopiuc, E. Shen, D. Srivastava, and T. Yu. Differentially private spatial decompositions. In ICDE, 2012.
    [13]
    G. Cormode, C. M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially private publication of sparse data. In ICDT, 2012.
    [14]
    A. B. Cybakov. Introduction to nonparametric estimation. Springer, 2009.
    [15]
    B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private data cubes: optimizing noise sources and consistency. In SIGMOD, pages 217--228, 2011.
    [16]
    C. Dwork. Differential privacy. In ICALP, pages 1--12, 2006.
    [17]
    C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265--284, 2006.
    [18]
    D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In STOC, pages 361--370, 2009.
    [19]
    A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, pages 493--502, 2010.
    [20]
    M. Hardt, K. Ligett, and F. McSherry. A simple and practical algorithm for differentially private data release. In NIPS, pages 2348--2356, 2012.
    [21]
    M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private histograms through consistency. PVLDB, 3(1):1021--1032, 2010.
    [22]
    D. Kifer, A. D. Smith, and A. Thakurta. Private convex optimization for empirical risk minimization with applications to high-dimensional regression. Journal of Machine Learning Research - Proceedings Track, 23:25.1--25.40, 2012.
    [23]
    D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
    [24]
    C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing linear counting queries under differential privacy. In PODS, pages 123--134, 2010.
    [25]
    C. Li and G. Miklau. An adaptive mechanism for accurate query answering under differential privacy. PVLDB, 5(6):514--525, 2012.
    [26]
    C. Li and G. Miklau. Optimal error of query sets under the differentially-private matrix mechanism. In ICDT, pages 272--283, 2013.
    [27]
    N. Li, W. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset mining with differential privacy. PVLDB, 5(11):1340--1351, 2012.
    [28]
    D. Margaritis. Learning Bayesian Network Model Structure from Data. PhD thesis, School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, May 2003.
    [29]
    F. McSherry and R. Mahajan. Differentially-private network trace analysis. In SIGCOMM, pages 123--134, 2010.
    [30]
    F. McSherry and I. Mironov. Differentially private recommender systems: Building privacy into the netflix prize contenders. In KDD, pages 627--636, 2009.
    [31]
    F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94--103, 2007.
    [32]
    P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: privacy preserving data analysis made easy. In SIGMOD, pages 349--360, 2012.
    [33]
    K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75--84, 2007.
    [34]
    V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD, pages 735--746, 2010.
    [35]
    B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. Journal of Privacy and Confidentiality, 4(1):65--100, 2012.
    [36]
    A. Smith. Privacy-preserving statistical estimation with optimal convergence rate. In STOC, 2011.
    [37]
    X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. In ICDE, pages 225--236, 2010.
    [38]
    G. Yaroslavtsev, G. Cormode, C. M. Procopiuc, and D. Srivastava. Accurate and efficient private release of datacubes and contingency tables. In ICDE, pages 745--756, 2013.
    [39]
    G. Yuan, Z. Zhang, M. Winslett, X. Xiao, Y. Yang, and Z. Hao. Low-rank mechanism: Optimizing batch queries under differential privacy. PVLDB, 5(11):1352--1363, 2012.
    [40]
    J. Zhang, X. Xiao, Y. Yang, Z. Zhang, and M. Winslett. PrivGene: differentially private model fitting using genetic algorithms. In SIGMOD, pages 665--676, 2013.

    Cited By

    View all
    • (2024)Differentially Private Data Generation with Missing DataProceedings of the VLDB Endowment10.14778/3659437.365945517:8(2022-2035)Online publication date: 31-May-2024
    • (2024)30 Years of Synthetic DataStatistical Science10.1214/24-STS92739:2Online publication date: 1-May-2024
    • (2024)Collaborative learning from distributed data with differentially private synthetic dataBMC Medical Informatics and Decision Making10.1186/s12911-024-02563-724:1Online publication date: 14-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bayesian network
    2. differential privacy
    3. synthetic data generation

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)11

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Differentially Private Data Generation with Missing DataProceedings of the VLDB Endowment10.14778/3659437.365945517:8(2022-2035)Online publication date: 31-May-2024
    • (2024)30 Years of Synthetic DataStatistical Science10.1214/24-STS92739:2Online publication date: 1-May-2024
    • (2024)Collaborative learning from distributed data with differentially private synthetic dataBMC Medical Informatics and Decision Making10.1186/s12911-024-02563-724:1Online publication date: 14-Jun-2024
    • (2024)Epistemic Parity: Reproducibility as an Evaluation Metric for Differential PrivacyACM SIGMOD Record10.1145/3665252.366526753:1(65-74)Online publication date: 14-May-2024
    • (2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
    • (2024)Graph-Based Data Publication via Differentially Structural InferenceIEEE Transactions on Network and Service Management10.1109/TNSM.2023.331742021:1(1257-1270)Online publication date: Feb-2024
    • (2024)Secure Normal Form: Mediation Among Cross Cryptographic Leakages in Encrypted Databases2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00444(5560-5573)Online publication date: 13-May-2024
    • (2024)Künstliche Intelligenz und sichere Gesundheitsdatennutzung im Projekt KI-FDZ: Anonymisierung, Synthetisierung und sichere Verarbeitung für Real-World-DatenArtificial intelligence and secure use of health data in the KI-FDZ project: anonymization, synthetization, and secure processing of real-world dataBundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz10.1007/s00103-023-03823-z67:2(171-179)Online publication date: 4-Jan-2024
    • (2024)Synthetic Data Generation for Differential Privacy Using Maximum Weight MatchingAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0798-0_8(121-138)Online publication date: 1-Mar-2024
    • (2023)PreFair: Privately Generating Justifiably Fair Synthetic DataProceedings of the VLDB Endowment10.14778/3583140.358316816:6(1573-1586)Online publication date: 1-Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media