Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

被引:26
|
作者
Bey, Romain [1 ,2 ]
Goussault, Romain [3 ]
Grolleau, Francois [1 ,2 ]
Benchoufi, Mehdi [1 ,2 ]
Porcher, Raphael [1 ,2 ]
机构
[1] Univ Paris, Ctr Res Epidemiol & Stat CRESS, French Inst Hlth & Med Res, Natl Inst Agr Res INRA,INSERM, Paris, France
[2] Nantes Univ, Ctr Hosp Univ Nantes, CIC 1413, Ctr Res Cancerol & Immunol Nantes Angers CRCINA,D, Nantes, France
[3] Nantes Univ, Ctr Hosp Univ Nantes, Ctr Res Cancerol & Immunol Nantes Angers CRCINA, Dermatol Dept, Nantes CIC 1413, France
关键词
federated learning; privacy; validation; duplicated electronic health records; data leakage; ELECTRONIC HEALTH RECORDS;
D O I
10.1093/jamia/ocaa096
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). Materials and Methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
引用
收藏
页码:1244 / 1251
页数:8
相关论文
共 50 条
  • [21] Privacy-Preserving Federated Learning in Fog Computing
    Zhou, Chunyi
    Fu, Anmin
    Yu, Shui
    Yang, Wei
    Wang, Huaqun
    Zhang, Yuqing
    IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (11): : 10782 - 10793
  • [22] Federated Learning for Privacy-Preserving Speaker Recognition
    Woubie, Abraham
    Backstrom, Tom
    IEEE ACCESS, 2021, 9 : 149477 - 149485
  • [23] Privacy-Preserving Decentralized Aggregation for Federated Learning
    Jeon, Beomyeol
    Ferdous, S. M.
    Rahmant, Muntasir Raihan
    Walid, Anwar
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM WKSHPS 2021), 2021,
  • [24] GAIN: Decentralized Privacy-Preserving Federated Learning
    Jiang, Changsong
    Xu, Chunxiang
    Cao, Chenchen
    Chen, Kefei
    JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2023, 78
  • [25] Privacy-Preserving Federated Learning via Disentanglement
    Zhou, Wenjie
    Li, Piji
    Han, Zhaoyang
    Lu, Xiaozhen
    Li, Juan
    Ren, Zhaochun
    Liu, Zhe
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 3606 - 3615
  • [26] Privacy-preserving Decentralized Federated Deep Learning
    Zhu, Xudong
    Li, Hui
    PROCEEDINGS OF ACM TURING AWARD CELEBRATION CONFERENCE, ACM TURC 2021, 2021, : 33 - 38
  • [27] PRIVACY-PRESERVING SERVICES USING FEDERATED LEARNING
    Taylor, Paul
    Kiss, Stephanie
    Gullon, Lucy
    Yearling, David
    Journal of the Institute of Telecommunications Professionals, 2022, 16 : 16 - 22
  • [28] Privacy-Preserving and Reliable Distributed Federated Learning
    Dong, Yipeng
    Zhang, Lei
    Xu, Lin
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT I, 2024, 14487 : 130 - 149
  • [29] Improved Privacy-Preserving Aggregation for Federated Learning
    Li, Yu
    Han, Yiliang
    Zhou, Tanping
    Xie, Huiyu
    Wu, Xuguang
    Song, Chaoyue
    2024 9TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS, ICCCS 2024, 2024, : 272 - 276
  • [30] Measuring Contributions in Privacy-Preserving Federated Learning
    Pejo, Balazs
    Biczok, Gergely
    Acs, Gergely
    ERCIM NEWS, 2021, (126): : 35 - 36