Tree-based prediction on incomplete data using imputation or surrogate decisions

被引:50
|
作者
Valdiviezo, H. Cevallos [1 ]
Van Aelst, S. [1 ,2 ]
机构
[1] Univ Ghent, Dept Appl Math Comp Sci & Stat, B-9000 Ghent, Belgium
[2] Katholieke Univ Leuven, Dept Math, Sect Stat, B-3001 Louvain, Belgium
关键词
Prediction; Missing data; Surrogate decision; Multiple imputation; Conditional inference tree; MULTIPLE IMPUTATION; MISSING DATA; MICE;
D O I
10.1016/j.ins.2015.03.018
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal is to investigate the prediction performance of tree-based techniques when the available training data contains features with missing values. Also the future test cases may contain missing values and thus the methods should be able to generate predictions for such test cases. The missing values are handled either by using surrogate decisions within the trees or by the combination of an imputation method with a tree-based method. Missing values generated according to missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) mechanisms are considered with various fractions of missing data. Imputation models are built in the learning phase and do not make use of the response variable, so that the resulting procedures allow to predict individual incomplete test cases. In the empirical comparison, both classification and regression problems are considered using a simulated and real-life datasets. The performance is evaluated by misclassification rate of predictions and mean squared prediction error, respectively. Overall, our results show that for smaller fractions of missing data an ensemble method combined with surrogates or single imputation suffices. For moderate to large fractions of missing values ensemble methods based on conditional inference trees combined with multiple imputation show the best performance, while conditional bagging using surrogates is a good alternative for high-dimensional prediction problems. Theoretical results confirm the potential better prediction performance of multiple imputation ensembles. (c) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:163 / 181
页数:19
相关论文
共 50 条
  • [21] Tree-based boosting with functional data
    Ju, Xiaomeng
    Salibian-Barrera, Matias
    COMPUTATIONAL STATISTICS, 2024, 39 (03) : 1587 - 1620
  • [22] Prediction and forecast of surface wind using ML tree-based algorithms
    M. H. ElTaweel
    S. C. Alfaro
    G. Siour
    A. Coman
    S. M. Robaa
    M. M. Abdel Wahab
    Meteorology and Atmospheric Physics, 2024, 136
  • [23] Tree-based boosting with functional data
    Xiaomeng Ju
    Matías Salibián-Barrera
    Computational Statistics, 2024, 39 : 1587 - 1620
  • [24] Tree-Based Models for Correlated Data
    Rabinowicz, Assaf
    Rosset, Saharon
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [25] Tree-Based Models for Correlated Data
    Rabinowicz, Assaf
    Rosset, Saharon
    Journal of Machine Learning Research, 2022, 23
  • [26] Prediction and forecast of surface wind using ML tree-based algorithms
    Eltaweel, M. H.
    Alfaro, S. C.
    Siour, G.
    Coman, A.
    Robaa, S. M.
    Wahab, M. M. Abdel
    METEOROLOGY AND ATMOSPHERIC PHYSICS, 2024, 136 (01)
  • [27] Handling Incomplete Data Using Evolution of Imputation Methods
    Zawistowski, Pawel
    Grzenda, Maciej
    ADAPTIVE AND NATURAL COMPUTING ALGORITHMS, 2009, 5495 : 22 - +
  • [28] Classification of repeated measurements data using tree-based ensemble methods
    Werner Adler
    Sergej Potapov
    Berthold Lausen
    Computational Statistics, 2011, 26
  • [29] A fair comparison of tree-based and parametric methods in multiple imputation by chained equations
    Slade, Emily
    Naylor, Melissa G.
    STATISTICS IN MEDICINE, 2020, 39 (08) : 1156 - 1166
  • [30] Feature Scoring using Tree-Based Ensembles for Evolving Data Streams
    Gomes, Heitor Murilo
    de Mello, Rodrigo Fernandes
    Pfahringer, Bernhard
    Bifet, Albert
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 761 - 769