Missing data imputation using classification and regression trees

被引:0
|
作者
Chen, Cheng-Yang [1 ]
Chang, Yu-Wei [1 ]
机构
[1] Natl Chengchi Univ, Dept Stat, Taipei, Taiwan
关键词
Classification and regression trees; Missing data; Missing data imputation; Resampling; MULTIPLE IMPUTATION; DECISION TREES; BART;
D O I
10.7717/peerj-cs.2119
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background. Missing data are common when analyzing real data. One popular solution is to impute missing data so that one complete dataset can be obtained for subsequent data analysis. In the present study, we focus on missing data imputation using classification and regression trees (CART). Methods. We consider a new perspective on missing data in a CART imputation problem and realize the perspective through some resampling algorithms. Several existing missing data imputation methods using CART are compared through simulation studies, and we aim to investigate the methods with better imputation accuracy under various conditions. Some systematic findings are demonstrated and presented. These imputation methods are further applied to two real datasets: Hepatitis data and Credit approval data for illustration. Results. The method that performs the best strongly depends on the correlation between variables. For imputing missing ordinal categorical variables, the rpart package with surrogate variables is recommended under correlations larger than 0 with missing completely at random (MCAR) and missing at random (MAR) conditions. Under missing not at random (MNAR), chi-squared test methods and the rpart package with surrogate variables are suggested. For imputing missing quantitative variables, the iterative imputation method is most recommended under moderate correlation conditions.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] MISSING DATA, IMPUTATION AND REGRESSION TREES
    Loh, Wei-Yin
    Zhang, Qiong
    Zhang, Wenwen
    Zhou, Peigen
    [J]. STATISTICA SINICA, 2020, 30 (04) : 1697 - 1722
  • [2] Multiple Imputation for Missing Data via Sequential Regression Trees
    Burgette, Lane F.
    Reiter, Jerome P.
    [J]. AMERICAN JOURNAL OF EPIDEMIOLOGY, 2010, 172 (09) : 1070 - 1076
  • [3] MICROARRAY MISSING DATA IMPUTATION USING REGRESSION
    Bayrak, Tuncay
    Ogul, Hasan
    [J]. 2017 13TH IASTED INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING (BIOMED), 2017, : 68 - 73
  • [4] Regression multiple imputation for missing data analysis
    Yu, Lili
    Liu, Liang
    Peace, Karl E.
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2020, 29 (09) : 2647 - 2664
  • [5] Using multiple imputation to estimate missing data in meta-regression
    Ellington, E. Hance
    Bastille-Rousseau, Guillaume
    Austin, Cayla
    Landolt, Kristen N.
    Pond, Bruce A.
    Rees, Erin E.
    Robar, Nicholas
    Murray, Dennis L.
    [J]. METHODS IN ECOLOGY AND EVOLUTION, 2015, 6 (02): : 153 - 163
  • [6] Missing data imputation using decision trees and fuzzy clustering with iterative learning
    Sanaz Nikfalazar
    Chung-Hsing Yeh
    Susan Bedingfield
    Hadi A. Khorshidi
    [J]. Knowledge and Information Systems, 2020, 62 : 2419 - 2437
  • [7] Imputation of missing data with neural networks for classification
    Choudhury, Suyra Jyoti
    Pal, Nikhil R.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 182
  • [8] Missing data imputation using decision trees and fuzzy clustering with iterative learning
    Nikfalazar, Sanaz
    Yeh, Chung-Hsing
    Bedingfield, Susan
    Khorshidi, Hadi A.
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (06) : 2419 - 2437
  • [9] REGRESSION IMPUTATION OF MISSING VALUES IN LONGITUDINAL DATA SETS
    SCHNEIDERMAN, ED
    KOWALSKI, CJ
    WILLIS, SM
    [J]. INTERNATIONAL JOURNAL OF BIO-MEDICAL COMPUTING, 1993, 32 (02): : 121 - 133
  • [10] Imputation Methods for Multiple Regression with Missing Heteroscedastic Data
    Asif, Muhammad
    Samart, Klairung
    [J]. THAILAND STATISTICIAN, 2022, 20 (01): : 1 - 15