Missing data imputation using statistical and machine learning methods in a real breast cancer problem

被引:306
|
作者
Jerez, Jose M. [1 ]
Molina, Ignacio [2 ]
Garcia-Laencina, Pedro J. [3 ]
Alba, Emilio [4 ]
Ribelles, Nuria [4 ]
Martin, Miguel [5 ]
Franco, Leonardo [1 ]
机构
[1] Univ Malaga, ETSI Informat, Dept Lenguajes & Ciencias Computac, E-29071 Malaga, Spain
[2] Univ Malaga, Dept Tecnol Elect, E-29071 Malaga, Spain
[3] Univ Politecn Cartagena, Dept Tecnol Informac & Comunicac, Cartagena 30202, Murcia, Spain
[4] Hosp Clin Univ Virgen Victoria, Med Oncol Serv, Malaga 29010, Spain
[5] Hosp Clin San Carlos, Med Oncol Serv, Madrid 28040, Spain
关键词
Missing data; Statistical imputation techniques; Machine learning imputation methods; Survival analysis; Breast cancer prognosis; Early breast cancer; ARTIFICIAL NEURAL-NETWORKS; MULTIPLE IMPUTATION; HOT-DECK; MODEL; PROGNOSIS; ALGORITHM; VALUES;
D O I
10.1016/j.artmed.2010.05.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objectives: Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set Materials and methods. Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g. multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Alamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values Results: The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p = 0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p = 00053, p = 0 0048 and p = 00071, respectively) than the AUC from the LD-based prognosis model. Conclusion: The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:105 / 115
页数:11
相关论文
共 50 条
  • [21] Machine Learning Aids Imputation of Missing Petrophysical Data in Iraqi Reservoir
    Abdulkhaleq, Hussein B.
    Al-Mudhafar, Watheq J.
    Wood, David A.
    [J]. JPT, Journal of Petroleum Technology, 1600, 76 (08): : 58 - 61
  • [22] Evaluation of Machine Learning Classification Algorithms & Missing Data Imputation Techniques
    Nwulu, Nnamdi I.
    [J]. 2017 INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM (IDAP), 2017,
  • [23] Graph Machine Learning for Improved Imputation of Missing Tropospheric Ozone Data
    Betancourt, Clara
    Li, Cathy W. Y.
    Kleinert, Felix
    Schultz, Martin G.
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2023, 57 (46) : 18246 - 18258
  • [24] Missing data imputation using fuzzy-rough methods
    Amiri, Mehran
    Jensen, Richard
    [J]. NEUROCOMPUTING, 2016, 205 : 152 - 164
  • [25] Prediction of missing temperature data using different machine learning methods
    Okan Mert Katipoğlu
    [J]. Arabian Journal of Geosciences, 2022, 15 (1)
  • [26] Quality Assessment of Data Using Statistical and Machine Learning Methods
    Singh, Prerna
    Suri, Bharti
    [J]. COMPUTATIONAL INTELLIGENCE IN DATA MINING, VOL 2, 2015, 32 : 89 - 97
  • [27] Missing Data Analysis Using Statistical and Machine Learning Methods in Facility-Based Maternal Health Records
    Memon S.M.Z.
    Wamala R.
    Kabano I.H.
    [J]. SN Computer Science, 3 (5)
  • [28] Missing Data Imputation for Supervised Learning
    Poulos, Jason
    Valle, Rafael
    [J]. APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (02) : 186 - 196
  • [29] A comparison of imputation methods using machine learning models
    Suh, Heajung
    Song, Jongwoo
    [J]. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2023, 30 (03) : 331 - 341
  • [30] Missing Data Imputation Using Ensemble Learning Technique: A Review
    Jegadeeswari, K.
    Ragunath, R.
    Rathipriya, R.
    [J]. SOFT COMPUTING FOR SECURITY APPLICATIONS, ICSCS 2022, 2023, 1428 : 223 - 236