Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

被引:26
|
作者
Bey, Romain [1 ,2 ]
Goussault, Romain [3 ]
Grolleau, Francois [1 ,2 ]
Benchoufi, Mehdi [1 ,2 ]
Porcher, Raphael [1 ,2 ]
机构
[1] Univ Paris, Ctr Res Epidemiol & Stat CRESS, French Inst Hlth & Med Res, Natl Inst Agr Res INRA,INSERM, Paris, France
[2] Nantes Univ, Ctr Hosp Univ Nantes, CIC 1413, Ctr Res Cancerol & Immunol Nantes Angers CRCINA,D, Nantes, France
[3] Nantes Univ, Ctr Hosp Univ Nantes, Ctr Res Cancerol & Immunol Nantes Angers CRCINA, Dermatol Dept, Nantes CIC 1413, France
关键词
federated learning; privacy; validation; duplicated electronic health records; data leakage; ELECTRONIC HEALTH RECORDS;
D O I
10.1093/jamia/ocaa096
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). Materials and Methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
引用
收藏
页码:1244 / 1251
页数:8
相关论文
共 50 条
  • [31] A Privacy-Preserving and Verifiable Federated Learning Scheme
    Zhang, Xianglong
    Fu, Anmin
    Wang, Huaqun
    Zhou, Chunyi
    Chen, Zhenzhu
    ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
  • [32] AN EXPLORATION OF FEDERATED LEARNING FOR PRIVACY-PRESERVING MACHINE LEARNING
    Kumar, K. Kiran
    Rao, Thalakola Syamsundara
    Vullam, Nagagopiraju
    Vellela, Sai Srinivas
    Jyosthna, B.
    Farjana, Shaik
    Javvadi, Sravanthi
    2024 5TH INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN INFORMATION TECHNOLOGY, ICITIIT 2024, 2024,
  • [33] Privacy-Preserving Robust Federated Learning with Distributed Differential Privacy
    Wang, Fayao
    He, Yuanyuan
    Guo, Yunchuan
    Li, Peizhi
    Wei, Xinyu
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 598 - 605
  • [34] A efficient and robust privacy-preserving framework for cross-device federated learning
    Du, Weidong
    Li, Min
    Wu, Liqiang
    Han, Yiliang
    Zhou, Tanping
    Yang, Xiaoyuan
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4923 - 4937
  • [35] Cross the Chasm: Scalable Privacy-Preserving Federated Learning against Poisoning Attack
    Li, Yiran
    Hu, Guiqiang
    Liu, Xiaoyuan
    Ying, Zuobin
    2021 18TH INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY AND TRUST (PST), 2021,
  • [36] A efficient and robust privacy-preserving framework for cross-device federated learning
    Weidong Du
    Min Li
    Liqiang Wu
    Yiliang Han
    Tanping Zhou
    Xiaoyuan Yang
    Complex & Intelligent Systems, 2023, 9 : 4923 - 4937
  • [37] FLZip: An Efficient and Privacy-Preserving Framework for Cross-Silo Federated Learning
    Feng, Xiaojie
    Du, Haizhou
    IEEE CONGRESS ON CYBERMATICS / 2021 IEEE INTERNATIONAL CONFERENCES ON INTERNET OF THINGS (ITHINGS) / IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) / IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) / IEEE SMART DATA (SMARTDATA), 2021, : 209 - 216
  • [38] Privacy-Preserving Cross-Silo Federated Learning Atop Blockchain for IoT
    Li, Huilin
    Sun, Yu
    Yu, Yong
    Li, Dawei
    Guan, Zhenyu
    Liu, Jianwei
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (24) : 21176 - 21186
  • [39] Lightweight Privacy-Preserving Cross-Cluster Federated Learning With Heterogeneous Data
    Chen, Zekai
    Yu, Shengxing
    Chen, Farong
    Wang, Fuyi
    Liu, Ximeng
    Deng, Robert H.
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 7404 - 7419
  • [40] No unbiased estimator of the variance of K-fold cross-validation
    Bengio, Y
    Grandvalet, Y
    JOURNAL OF MACHINE LEARNING RESEARCH, 2004, 5 : 1089 - 1105