The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model

被引:12
|
作者
Guo, Chao-Yu [1 ,2 ]
Yang, Ying-Chen [1 ,2 ]
Chen, Yi-Hau [3 ]
机构
[1] Natl Yang Ming Univ, Inst Publ Hlth, Sch Med, Taipei, Taiwan
[2] Natl Yang Ming Chiao Tung Univ, Inst Publ Hlth, Sch Med, Hsinchu, Taiwan
[3] Acad Sinica, Inst Stat Sci, Taipei, Taiwan
关键词
machine learning; k-nearest neighbors imputation; random forest imputation; survival data simulation; cox proportional hazard model;
D O I
10.3389/fpubh.2021.680054
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric "missForest" based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] MULTIPLE IMPUTATION AS A MISSING DATA MACHINE
    BRAND, J
    VANBUUREN, S
    VANMULLIGEN, EM
    TIMMERS, T
    GELSEMA, E
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1994, : 303 - 306
  • [22] Pavement Missing Condition Data Imputation through Collective Learning-Based Graph Neural Networks
    Yu, Ke
    Gao, Lu
    [J]. INTERNATIONAL CONFERENCE ON TRANSPORTATION AND DEVELOPMENT 2023: TRANSPORTATION PLANNING, OPERATIONS, AND TRANSIT, 2023, : 416 - 423
  • [23] Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques
    Liu, Mingxuan
    Li, Siqi
    Yuan, Han
    Ong, Marcus Eng Hock
    Ning, Yilin
    Xie, Feng
    Saffari, Seyed Ehsan
    Shang, Yuqing
    Volovici, Victor
    Chakraborty, Bibhas
    Liu, Nan
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 142
  • [24] Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning
    Suresh, Marcus
    Taib, Ronnie
    Zhao, Yanchang
    Jin, Warren
    [J]. AI 2019: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11919 : 215 - 227
  • [25] Competing risk and the Cox proportional hazard model
    Cooke, RM
    Morales-Napoles, O
    [J]. JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2006, 136 (05) : 1621 - 1637
  • [26] Evaluation of Machine Learning Classification Algorithms & Missing Data Imputation Techniques
    Nwulu, Nnamdi I.
    [J]. 2017 INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM (IDAP), 2017,
  • [27] Data Imputation in Wireless Sensor Networks Using a Machine Learning-Based Virtual Sensor
    Matusowsky, Michael
    Ramotsoela, Daniel T.
    Abu-Mahfouz, Adnan M.
    [J]. JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2020, 9 (02)
  • [28] Graph Machine Learning for Improved Imputation of Missing Tropospheric Ozone Data
    Betancourt, Clara
    Li, Cathy W. Y.
    Kleinert, Felix
    Schultz, Martin G.
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2023, 57 (46) : 18246 - 18258
  • [29] Missing data imputation using machine learning based methods to improve HCC survival prediction
    Yumus, Mehmethan
    Apaydin, Merve
    Degirmenci, Ali
    Karal, Omer
    [J]. 2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [30] Machine learning motivated data imputation of storm data used in coastal hazard assessments
    Liu, Ziyue
    Carr, Meredith L.
    Nadal-Caraballo, Norberto C.
    Yawn, Madison C.
    Taflanidis, Alexandros A.
    Bensi, Michelle T.
    [J]. COASTAL ENGINEERING, 2024, 190