Benchmarking Evaluation Protocols for Classifiers Trained on Differentially Private Synthetic Data

被引:0
|
作者
Movahedi, Parisa [1 ]
Nieminen, Valtteri [1 ,2 ]
Perez, Ileana Montoya [1 ]
Daafane, Hiba [1 ]
Sukhwal, Dishant [1 ]
Pahikkala, Tapio [1 ]
Airola, Antti [1 ]
机构
[1] Turku Univ, Dept Comp, Turku 20014, Finland
[2] Helsinki Univ Hosp HUS, Helsinki 00290, Finland
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Protocols; Synthetic data; Data models; Privacy; Analytical models; Machine learning; Bioinformatics; Classification algorithms; Differential privacy; Generative AI; Biomedical data; classification; differential privacy; generative AI; model evaluation; synthetic data;
D O I
10.1109/ACCESS.2024.3446913
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Differentially private (DP) synthetic data has emerged as a potential solution for sharing sensitive individual-level biomedical data. DP generative models offer a promising approach for generating realistic synthetic data that aims to maintain the original data's central statistical properties while ensuring privacy by limiting the risk of disclosing sensitive information about individuals. However, the issue regarding how to assess the expected real-world prediction performance of machine learning models trained on synthetic data remains an open question. In this study, we experimentally evaluate two different model evaluation protocols for classifiers trained on synthetic data. The first protocol employs solely synthetic data for downstream model evaluation, whereas the second protocol assumes limited DP access to a private test set consisting of real data managed by a data curator. We also propose a metric for assessing how well the evaluation results of the proposed protocols match the real-world prediction performance of the models. The assessment measures both the systematic error component indicating how optimistic or pessimistic the protocol is on average and the random error component indicating the variability of the protocol's error. The results of our study suggest that employing the second protocol is advantageous, particularly in biomedical health studies where the precision of the research is of utmost importance. Our comprehensive empirical study offers new insights into the practical feasibility and usefulness of different evaluation protocols for classifiers trained on DP-synthetic data.
引用
收藏
页码:118637 / 118648
页数:12
相关论文
共 50 条
  • [41] Differentially Private Distributed Data Analysis
    Takabi, Hassan
    Koppikar, Samir
    Zargar, Saman Taghavi
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), 2016, : 212 - 218
  • [42] Differentially private multidimensional data publishing
    Al-Hussaeni, Khalil
    Fung, Benjamin C. M.
    Iqbal, Farkhund
    Liu, Junqiang
    Hung, Patrick C. K.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 56 (03) : 717 - 752
  • [43] Differentially Private Grids for Geospatial Data
    Qardaji, Wahbeh
    Yang, Weining
    Li, Ninghui
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 757 - 768
  • [44] Differentially private multidimensional data publishing
    Khalil Al-Hussaeni
    Benjamin C. M. Fung
    Farkhund Iqbal
    Junqiang Liu
    Patrick C. K. Hung
    Knowledge and Information Systems, 2018, 56 : 717 - 752
  • [45] Differentially Private Methods for Compositional Data
    Guo, Qi
    Barrientos, Andres F.
    Pena, Victor
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2024,
  • [46] Differentially Private Algorithms for Synthetic Power System Datasets
    Dvorkin, Vladimir
    Botterud, Audun
    IEEE CONTROL SYSTEMS LETTERS, 2023, 7 : 2053 - 2058
  • [47] Differentially private low-dimensional synthetic data from high-dimensional datasets
    He, Yiyun
    Strohmer, Thomas
    Vershynin, Roman
    Zhu, Yizhe
    INFORMATION AND INFERENCE-A JOURNAL OF THE IMA, 2025, 14 (01)
  • [48] Examining the Utility of Differentially Private Synthetic Data Generated using Variational Autoencoder with TensorFlow Privacy
    Tai, Bo-Chen
    Li, Szu-Chuang
    Huang, Yennun
    Wang, Pang-Chieh
    2022 IEEE 27TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC), 2022, : 236 - 241
  • [49] Distributed Synthetic Time-Series Data Generation With Local Differentially Private Federated Learning
    Jiang, Xue
    Zhou, Xuebing
    Grossklags, Jens
    IEEE ACCESS, 2024, 12 : 157067 - 157082
  • [50] Benchmarking Metagenomic Classifiers on Simulated Ancient and Modern Metagenomic Data
    Pusadkar, Vaidehi
    Azad, Rajeev K.
    MICROORGANISMS, 2023, 11 (10)