Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

被引:26
|
作者
Bey, Romain [1 ,2 ]
Goussault, Romain [3 ]
Grolleau, Francois [1 ,2 ]
Benchoufi, Mehdi [1 ,2 ]
Porcher, Raphael [1 ,2 ]
机构
[1] Univ Paris, Ctr Res Epidemiol & Stat CRESS, French Inst Hlth & Med Res, Natl Inst Agr Res INRA,INSERM, Paris, France
[2] Nantes Univ, Ctr Hosp Univ Nantes, CIC 1413, Ctr Res Cancerol & Immunol Nantes Angers CRCINA,D, Nantes, France
[3] Nantes Univ, Ctr Hosp Univ Nantes, Ctr Res Cancerol & Immunol Nantes Angers CRCINA, Dermatol Dept, Nantes CIC 1413, France
关键词
federated learning; privacy; validation; duplicated electronic health records; data leakage; ELECTRONIC HEALTH RECORDS;
D O I
10.1093/jamia/ocaa096
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). Materials and Methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
引用
收藏
页码:1244 / 1251
页数:8
相关论文
共 50 条
  • [41] PASTEL: Privacy-Preserving Federated Learning in Edge Computing
    Elhattab, Fatima
    Bouchenak, Sara
    Boscher, Cedric
    PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2023, 7 (04):
  • [42] PVFL: Verifiable federated learning and prediction with privacy-preserving
    Yin, Benxin
    Zhang, Hanlin
    Lin, Jie
    Kong, Fanyu
    Yu, Leyun
    COMPUTERS & SECURITY, 2024, 139
  • [43] Visual Object Detection for Privacy-Preserving Federated Learning
    Zhang, Jing
    Zhou, Jiting
    Guo, Jinyang
    Sun, Xiaohan
    IEEE ACCESS, 2023, 11 : 33324 - 33335
  • [44] Enforcing group fairness in privacy-preserving Federated Learning
    Chen, Chaomeng
    Zhou, Zhenhong
    Tang, Peng
    He, Longzhu
    Su, Sen
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 160 : 890 - 900
  • [45] Towards Efficient and Privacy-preserving Federated Deep Learning
    Hao, Meng
    Li, Hongwei
    Xu, Guowen
    Liu, Sen
    Yang, Haomiao
    ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
  • [46] No unbiased estimator of the variance of K-fold cross-validation
    Bengio, Y
    Grandvalet, Y
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 513 - 520
  • [47] Federated Learning for Privacy-Preserving Machine Learning in IoT Networks
    Anitha, G.
    Jegatheesan, A.
    2024 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT CYBER PHYSICAL SYSTEMS AND INTERNET OF THINGS, ICOICI 2024, 2024, : 338 - 342
  • [48] No unbiased estimator of the variance of K-fold cross-validation
    Bengio, Yoshua
    Grandvalet, Yves
    Journal of Machine Learning Research, 2004, 5 : 1089 - 1105
  • [49] DER Forecast Using Privacy-Preserving Federated Learning
    Venkataramanan, Venkatesh
    Kaza, Sridevi
    Annaswamy, Anuradha M.
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (03) : 2046 - 2055
  • [50] Decentralized federated learning with privacy-preserving for recommendation systems
    Guo, Jianlan
    Zhao, Qinglin
    Li, Guangcheng
    Chen, Yuqiang
    Lao, Chengxue
    Feng, Li
    ENTERPRISE INFORMATION SYSTEMS, 2023, 17 (09)