A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

被引:4
|
作者
Ge, Yingfeng [1 ]
Li, Zhiwei [1 ]
Zhang, Jinxin [1 ]
机构
[1] Sun Yat Sen Univ, Sch Publ Hlth, Dept Med Stat, Guangzhou 510080, Peoples R China
关键词
MULTIPLE IMPUTATION;
D O I
10.1038/s41598-023-36509-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Missing Data Analysis Using Statistical and Machine Learning Methods in Facility-Based Maternal Health Records
    Memon S.M.Z.
    Wamala R.
    Kabano I.H.
    [J]. SN Computer Science, 3 (5)
  • [32] Missing Data Imputation for Supervised Learning
    Poulos, Jason
    Valle, Rafael
    [J]. APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (02) : 186 - 196
  • [33] A Comparative Study of Missing Value Imputation Methods for Education Data
    Keerin, Phimmarin
    [J]. 29TH INTERNATIONAL CONFERENCE ON COMPUTERS IN EDUCATION (ICCE 2021), VOL II, 2021, : 109 - 117
  • [34] A comparison of imputation methods using machine learning models
    Suh, Heajung
    Song, Jongwoo
    [J]. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2023, 30 (03) : 331 - 341
  • [35] Missing Data Imputation Using Ensemble Learning Technique: A Review
    Jegadeeswari, K.
    Ragunath, R.
    Rathipriya, R.
    [J]. SOFT COMPUTING FOR SECURITY APPLICATIONS, ICSCS 2022, 2023, 1428 : 223 - 236
  • [36] Missing Value Imputation for RNA-Sequencing Data Using Statistical Models: A Comparative Study
    Taban Baghfalaki
    Mojtaba Ganjali
    Damon Berridge
    [J]. Journal of Statistical Theory and Applications, 2016, 15 (3): : 221 - 236
  • [37] Evaluation of machine learning methods for covariate data imputation in pharmacometrics
    Braem, Dominic Stefan
    Nahum, Uri
    Atkinson, Andrew
    Koch, Gilbert
    Pfister, Marc
    [J]. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY, 2022, 11 (12): : 1638 - 1648
  • [38] Variable selection with missing data in both covariates and outcomes: Imputation and machine learning
    Hu, Liangyuan
    Lin, Jung-Yi Joyce
    Ji, Jiayi
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (12) : 2651 - 2671
  • [39] Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?
    Chen, Zhi
    Tan, Sarah
    Chajewska, Urszula
    Rudin, Cynthia
    Caruana, Rich
    [J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 86 - 99