Tree-based prediction on incomplete data using imputation or surrogate decisions

被引:50
|
作者
Valdiviezo, H. Cevallos [1 ]
Van Aelst, S. [1 ,2 ]
机构
[1] Univ Ghent, Dept Appl Math Comp Sci & Stat, B-9000 Ghent, Belgium
[2] Katholieke Univ Leuven, Dept Math, Sect Stat, B-3001 Louvain, Belgium
关键词
Prediction; Missing data; Surrogate decision; Multiple imputation; Conditional inference tree; MULTIPLE IMPUTATION; MISSING DATA; MICE;
D O I
10.1016/j.ins.2015.03.018
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal is to investigate the prediction performance of tree-based techniques when the available training data contains features with missing values. Also the future test cases may contain missing values and thus the methods should be able to generate predictions for such test cases. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Missing values generated according to missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) mechanisms are considered with various fractions of missing data. Imputation models are built in the learning phase and do not make use of the response variable, so that the resulting procedures allow to predict individual incomplete test cases. In the empirical comparison, both classification and regression problems are considered using a simulated and real-life datasets. The performance is evaluated by misclassification rate of predictions and mean squared prediction error, respectively. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles. (c) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:163 / 181
页数:19
相关论文
共 50 条
  • [1] Recursive partitioning on incomplete data using surrogate decisions and multiple imputation
    Hapfelmeier, A.
    Hothorn, T.
    Ulm, K.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2012, 56 (06) : 1552 - 1565
  • [2] Tree-based Approach to Missing Data Imputation
    Vateekul, Peerapon
    Sarinnapakorn, Kanoksri
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 70 - +
  • [3] Boosted incremental tree-based imputation of missing data
    Siciliano, Roberta
    Aria, Massimo
    D'Ambrosio, Antonio
    DATA ANALYSIS, CLASSIFICATION AND THE FORWARD SEARCH, 2006, : 271 - +
  • [4] Robust tree-based incremental imputation method for data fusion
    D'Ambrosio, Antonio
    Aria, Massimo
    Siciliano, Roberta
    ADVANCES IN INTELLIGENT DATA ANALYSIS VII, PROCEEDINGS, 2007, 4723 : 174 - +
  • [5] Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering
    Claudio Conversano
    Roberta Siciliano
    Journal of Classification, 2009, 26 : 361 - 379
  • [6] Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering
    Conversano, Claudio
    Siciliano, Roberta
    JOURNAL OF CLASSIFICATION, 2009, 26 (03) : 361 - 379
  • [7] Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm
    D'Ambrosio, Antonio
    Aria, Massimo
    Siciliano, Roberta
    JOURNAL OF CLASSIFICATION, 2012, 29 (02) : 227 - 258
  • [8] Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm
    Antonio D’Ambrosio
    Massimo Aria
    Roberta Siciliano
    Journal of Classification, 2012, 29 : 227 - 258
  • [9] Prediction of OCR and su from PCPT Data Using Tree-Based Data Fusion Techniques
    Griffin, Erin P.
    Kurup, Pradeep U.
    JOURNAL OF GEOTECHNICAL AND GEOENVIRONMENTAL ENGINEERING, 2017, 143 (09)
  • [10] Travel Time Prediction Using Tree-Based Ensembles
    Huang, He
    Pouls, Martin
    Meyer, Anne
    Pauly, Markus
    COMPUTATIONAL LOGISTICS, ICCL 2020, 2020, 12433 : 412 - 427