Missing data imputation using statistical and machine learning methods in a real breast cancer problem

被引:306
|
作者
Jerez, Jose M. [1 ]
Molina, Ignacio [2 ]
Garcia-Laencina, Pedro J. [3 ]
Alba, Emilio [4 ]
Ribelles, Nuria [4 ]
Martin, Miguel [5 ]
Franco, Leonardo [1 ]
机构
[1] Univ Malaga, ETSI Informat, Dept Lenguajes & Ciencias Computac, E-29071 Malaga, Spain
[2] Univ Malaga, Dept Tecnol Elect, E-29071 Malaga, Spain
[3] Univ Politecn Cartagena, Dept Tecnol Informac & Comunicac, Cartagena 30202, Murcia, Spain
[4] Hosp Clin Univ Virgen Victoria, Med Oncol Serv, Malaga 29010, Spain
[5] Hosp Clin San Carlos, Med Oncol Serv, Madrid 28040, Spain
关键词
Missing data; Statistical imputation techniques; Machine learning imputation methods; Survival analysis; Breast cancer prognosis; Early breast cancer; ARTIFICIAL NEURAL-NETWORKS; MULTIPLE IMPUTATION; HOT-DECK; MODEL; PROGNOSIS; ALGORITHM; VALUES;
D O I
10.1016/j.artmed.2010.05.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objectives: Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set Materials and methods. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g. multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Alamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values Results: The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p = 0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p = 00053, p = 0 0048 and p = 00071, respectively) than the AUC from the LD-based prognosis model. Conclusion: The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:105 / 115
页数:11
相关论文
共 50 条
  • [1] A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods
    Yingfeng Ge
    Zhiwei Li
    Jinxin Zhang
    [J]. Scientific Reports, 13
  • [2] A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods
    Ge, Yingfeng
    Li, Zhiwei
    Zhang, Jinxin
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [3] Missing Data Imputation using Machine Learning Algorithm for Supervised Learning
    Cenitta, D.
    Arjunan, R. Vijaya
    Prema, K., V
    [J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [4] Missing data imputation using machine learning based methods to improve HCC survival prediction
    Yumus, Mehmethan
    Apaydin, Merve
    Degirmenci, Ali
    Karal, Omer
    [J]. 2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [5] Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning
    Suresh, Marcus
    Taib, Ronnie
    Zhao, Yanchang
    Jin, Warren
    [J]. AI 2019: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11919 : 215 - 227
  • [6] Comparing Statistical and Machine Learning Imputation Techniques in Breast Cancer Classification
    Chlioui, Imane
    Abnane, Ibtissam
    Idri, Ali
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2020, PART IV, 2020, 12252 : 61 - 76
  • [7] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    [J]. CYBERNETICS AND SYSTEMS, 2023,
  • [8] Approximate Imputation Method for Missing Data in Machine Learning
    [J]. 1600, Xi'an Jiaotong University (51):
  • [9] Imputation of missing gas permeability data for polymer membranes using machine learning
    Yuan, Qi
    Longo, Mariagiulia
    Thornton, Aaron W.
    McKeown, Neil B.
    Comesana-Gandara, Bibiana
    Jansen, Johannes C.
    Jelfs, Kim E.
    [J]. JOURNAL OF MEMBRANE SCIENCE, 2021, 627
  • [10] ExtraImpute: A Novel Machine Learning Method for Missing Data Imputation
    Alabadla, Mustafa
    Sidi, Fatimah
    Ishak, Iskandar
    Ibrahim, Hamidah
    Affendey, Lilly Suriani
    Hamdan, Hazlina
    [J]. JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2022, 13 (05) : 470 - 476