Multi-metric comparison of machine learning imputation methods with application to breast cancer survival

被引:0
|
作者
El Badisy, Imad [1 ,2 ,3 ]
Graffeo, Nathalie [3 ]
Khalis, Mohamed [1 ,2 ]
Giorgi, Roch [3 ,4 ]
机构
[1] Mohammed VI Ctr Res & Innovat, Rabat, Morocco
[2] Mohammed VI Univ Hlth Sci, Int Sch Publ Hlth, Casablanca, Morocco
[3] Aix Marseille Univ, Sci Econ & Sociales Sante & Traitement Informat Me, ISSPAM, INSERM,IRD,SESSTIM, Marseille, France
[4] Aix Marseille Univ, Hop Timone, AP HM, ISSPAM,BioSTIC,Biostat & Technol Informat & Commun, Marseille, France
关键词
Machine learning; Imputation methods; Single and multiple imputation; Performance metrics; Breast cancer survival; Survival analysis; MISSING DATA; RANDOM FOREST; MODELS; MICE;
D O I
10.1186/s12874-024-02305-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Prediction of Breast Cancer Survival by Machine Learning Methods: An Application of Multiple Imputation
    Lotfnezhad Afshar, Hadi
    Jabbari, Nasrollah
    Khalkhali, Hamid Reza
    Esnaashari, Omid
    [J]. IRANIAN JOURNAL OF PUBLIC HEALTH, 2021, 50 (03) : 598 - 605
  • [2] Deep multi-metric training: the need of multi-metric curve evaluation to avoid weak learning
    Mamalakis, Michail
    Banerjee, Abhirup
    Ray, Surajit
    Wilkie, Craig
    Clayton, Richard H.
    Swift, Andrew J.
    Panoutsos, George
    Vorselaars, Bart
    [J]. Neural Computing and Applications, 2024, 36 (30) : 18841 - 18862
  • [3] A Comparison of Machine Learning Methods for the Prediction of Breast Cancer
    Silva, Sara
    Anunciacao, Orlando
    Lotz, Marco
    [J]. EVOLUTIONARY COMPUTATION, MACHINE LEARNING AND DATA MINING IN BIOINFORMATICS, 2011, 6623 : 159 - +
  • [4] Comparison of Machine Learning Methods for Breast Cancer Diagnosis
    Bayrak, Ebru Aydindag
    Kirci, Pinar
    Ensari, Tolga
    [J]. 2019 SCIENTIFIC MEETING ON ELECTRICAL-ELECTRONICS & BIOMEDICAL ENGINEERING AND COMPUTER SCIENCE (EBBT), 2019,
  • [5] Multi-metric learning by a pair of twin-metric learning framework
    Min Zhang
    Liming Yang
    Chao Yuan
    Qiangqiang Ren
    [J]. Applied Intelligence, 2022, 52 : 17490 - 17507
  • [6] A comparison of imputation methods using machine learning models
    Suh, Heajung
    Song, Jongwoo
    [J]. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2023, 30 (03) : 331 - 341
  • [7] Multi-metric learning by a pair of twin-metric learning framework
    Zhang, Min
    Yang, Liming
    Yuan, Chao
    Ren, Qiangqiang
    [J]. APPLIED INTELLIGENCE, 2022, 52 (15) : 17490 - 17507
  • [8] An efficient multi-metric learning method by partitioning the metric space
    Yuan, Chao
    Yang, Liming
    [J]. NEUROCOMPUTING, 2023, 529 : 56 - 79
  • [9] A comparison of machine learning techniques for survival prediction in breast cancer
    Leonardo Vanneschi
    Antonella Farinaccio
    Giancarlo Mauri
    Marco Antoniotti
    Paolo Provero
    Mario Giacobini
    [J]. BioData Mining, 4
  • [10] An efficient method for clustered multi-metric learning
    Bac Nguyen
    Ferri, Francesc J.
    Morell, Carlos
    De Baets, Bernard
    [J]. INFORMATION SCIENCES, 2019, 471 : 149 - 163