Multi-metric comparison of machine learning imputation methods with application to breast cancer survival

被引:0
|
作者
El Badisy, Imad [1 ,2 ,3 ]
Graffeo, Nathalie [3 ]
Khalis, Mohamed [1 ,2 ]
Giorgi, Roch [3 ,4 ]
机构
[1] Mohammed VI Ctr Res & Innovat, Rabat, Morocco
[2] Mohammed VI Univ Hlth Sci, Int Sch Publ Hlth, Casablanca, Morocco
[3] Aix Marseille Univ, Sci Econ & Sociales Sante & Traitement Informat Me, ISSPAM, INSERM,IRD,SESSTIM, Marseille, France
[4] Aix Marseille Univ, Hop Timone, AP HM, ISSPAM,BioSTIC,Biostat & Technol Informat & Commun, Marseille, France
关键词
Machine learning; Imputation methods; Single and multiple imputation; Performance metrics; Breast cancer survival; Survival analysis; MISSING DATA; RANDOM FOREST; MODELS; MICE;
D O I
10.1186/s12874-024-02305-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification
    Chlioui, Imane
    Abnane, Ibtissam
    Idri, Ali
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2020, PART IV, 2020, 12252 : 61 - 76
  • [22] Machine Learning Explainability in Breast Cancer Survival
    Jansen, Tom
    Geleijnse, Gijs
    Van Maaren, Marissa
    Hendriks, Mathijs P.
    Ten Teije, Annette
    Moncada-Torres, Arturo
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 307 - 311
  • [23] Osteoporosis, fracture and survival: Application of machine learning in breast cancer prediction models
    Ji, Lichen
    Zhang, Wei
    Zhong, Xugang
    Zhao, Tingxiao
    Sun, Xixi
    Zhu, Senbo
    Tong, Yu
    Luo, Junchao
    Xu, Youjia
    Yang, Di
    Kang, Yao
    Wang, Jin
    Bi, Qing
    FRONTIERS IN ONCOLOGY, 2022, 12
  • [24] Application of machine learning in breast cancer survival prediction using a multimethod approach
    Hamedi, Seyedeh Zahra
    Emami, Hassan
    Khayamzadeh, Maryam
    Rabiei, Reza
    Aria, Mehrad
    Akrami, Majid
    Zangouri, Vahid
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [25] PREDICTING THE PROBABILITY OF OUTCOME IN BREAST CANCER - A COMPARISON OF DIFFERENT MACHINE LEARNING METHODS
    Al-allak, A.
    Leonard, R.
    Lewis, P.
    EJC SUPPLEMENTS, 2010, 8 (06): : 26 - 26
  • [26] Imputation Procedures in Surveys Using Nonparametric and Machine Learning Methods: an Empirical Comparison
    Dagdoug, Mehdi
    Goga, Camelia
    Haziza, David
    JOURNAL OF SURVEY STATISTICS AND METHODOLOGY, 2023, 11 (01) : 141 - 188
  • [27] Dynamic Multi-Metric Thresholds for Scaling Applications Using Reinforcement Learning
    Rossi, Fabiana
    Cardellini, Valeria
    Lo Presti, Francesco
    Nardelli, Matteo
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (02) : 1807 - 1821
  • [28] Collaborative discriminative multi-metric learning for facial expression recognition in video
    Yan, Haibin
    PATTERN RECOGNITION, 2018, 75 : 33 - 40
  • [29] Application of machine learning methods in the imputation of heterogeneous co-missing data
    So, Hon Yiu
    Ma, Jinhui
    Griffith, Lauren E.
    Balakrishnan, Narayanaswamy
    JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE, 2025,
  • [30] Integrating Somatic Mutations for Breast Cancer Survival Prediction Using Machine Learning Methods
    He, Zongzhen
    Zhang, Junying
    Yuan, Xiguo
    Zhang, Yuanyuan
    FRONTIERS IN GENETICS, 2021, 11