Benchmarking Evaluation Protocols for Classifiers Trained on Differentially Private Synthetic Data

被引:0
|
作者
Movahedi, Parisa [1 ]
Nieminen, Valtteri [1 ,2 ]
Perez, Ileana Montoya [1 ]
Daafane, Hiba [1 ]
Sukhwal, Dishant [1 ]
Pahikkala, Tapio [1 ]
Airola, Antti [1 ]
机构
[1] Turku Univ, Dept Comp, Turku 20014, Finland
[2] Helsinki Univ Hosp HUS, Helsinki 00290, Finland
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Protocols; Synthetic data; Data models; Privacy; Analytical models; Machine learning; Bioinformatics; Classification algorithms; Differential privacy; Generative AI; Biomedical data; classification; differential privacy; generative AI; model evaluation; synthetic data;
D O I
10.1109/ACCESS.2024.3446913
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Differentially private (DP) synthetic data has emerged as a potential solution for sharing sensitive individual-level biomedical data. DP generative models offer a promising approach for generating realistic synthetic data that aims to maintain the original data's central statistical properties while ensuring privacy by limiting the risk of disclosing sensitive information about individuals. However, the issue regarding how to assess the expected real-world prediction performance of machine learning models trained on synthetic data remains an open question. In this study, we experimentally evaluate two different model evaluation protocols for classifiers trained on synthetic data. The first protocol employs solely synthetic data for downstream model evaluation, whereas the second protocol assumes limited DP access to a private test set consisting of real data managed by a data curator. We also propose a metric for assessing how well the evaluation results of the proposed protocols match the real-world prediction performance of the models. The assessment measures both the systematic error component indicating how optimistic or pessimistic the protocol is on average and the random error component indicating the variability of the protocol's error. The results of our study suggest that employing the second protocol is advantageous, particularly in biomedical health studies where the precision of the research is of utmost importance. Our comprehensive empirical study offers new insights into the practical feasibility and usefulness of different evaluation protocols for classifiers trained on DP-synthetic data.
引用
收藏
页码:118637 / 118648
页数:12
相关论文
共 50 条
  • [1] Evaluating Classifiers Trained on Differentially Private Synthetic Health Data
    Movahedi, Parisa
    Nieminen, Valtteri
    Perez, Ileana Montoya
    Pahikkala, Tapio
    Airola, Antti
    2023 IEEE 36TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS, 2023, : 748 - 753
  • [2] Differentially Private Ensemble Classifiers for Data Streams
    Gondara, Lovedeep
    Wang, Ke
    Carvalho, Ricardo Silva
    WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 325 - 333
  • [3] Algorithmically Effective Differentially Private Synthetic Data
    He, Yiyun
    Vershynin, Roman
    Zhu, Yizhe
    THIRTY SIXTH ANNUAL CONFERENCE ON LEARNING THEORY, VOL 195, 2023, 195
  • [4] Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?
    Perez, Ileana Montoya
    Movahedi, Parisa
    Nieminen, Valtteri
    Airola, Antti
    Pahikkala, Tapio
    METHODS OF INFORMATION IN MEDICINE, 2024, 63 (01/02) : 35 - 51
  • [5] Private Sampling: A Noiseless Approach for Generating Differentially Private Synthetic Data
    Boedihardjo, March
    Strohmer, Thomas
    Vershynin, Roman
    SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2022, 4 (03): : 1082 - 1115
  • [6] Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
    Singh, Krishnakant
    Navaratnam, Thanush
    Holmer, Jannik
    Schaub-Meyer, Simone
    Roth, Stefan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 2505 - 2515
  • [7] Collaborative learning from distributed data with differentially private synthetic data
    Prediger, Lukas
    Jalko, Joonas
    Honkela, Antti
    Kaski, Samuel
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [8] Differentially Private Synthetic Data Using KD-Trees
    Kreacic, Eleonora
    Nouri, Navid
    Potluru, Vamsi K.
    Balch, Tucker
    Veloso, Manuela
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 1143 - 1153
  • [9] AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data
    McKenna, Ryan
    Mullins, Brett
    Sheldon, Daniel
    Miklau, Gerome
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (11): : 2599 - 2612
  • [10] Differentially Private Normalizing Flows for Synthetic Tabular Data Generation
    Lee, Jaewoo
    Kim, Minjung
    Jeong, Yonghyun
    Ro, Youngmin
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 7345 - 7353