A comparison of random forest variable selection methods for classification prediction modeling

被引:616
|
作者
Speiser, Jaime Lynn [1 ]
Miller, Michael E. [1 ]
Tooze, Janet [1 ]
Ip, Edward [1 ]
机构
[1] Wake Forest Sch Med, Dept Biostat Sci, Med Ctr Blvd, Winston Salem, NC 27157 USA
基金
美国国家卫生研究院;
关键词
Random forest; Variable selection; Feature reduction; Classification;
D O I
10.1016/j.eswa.2019.05.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems. (C) 2019 Elsevier Ltd. All rights reserved.
引用
收藏
页码:93 / 101
页数:9
相关论文
共 50 条
  • [1] Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
    Eric W. Fox
    Ryan A. Hill
    Scott G. Leibowitz
    Anthony R. Olsen
    Darren J. Thornbrugh
    Marc H. Weber
    [J]. Environmental Monitoring and Assessment, 2017, 189
  • [2] Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
    Fox, Eric W.
    Hill, Ryan A.
    Leibowitz, Scott G.
    Olsen, Anthony R.
    Thornbrugh, Darren J.
    Weber, Marc H.
    [J]. ENVIRONMENTAL MONITORING AND ASSESSMENT, 2017, 189 (07)
  • [3] Random forest for ordinal responses: Prediction and variable selection
    Janitza, Silke
    Tutz, Gerhard
    Boulesteix, Anne-Laure
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 96 : 57 - 73
  • [4] Comparison of variable selection methods for clinical predictive modeling
    Sanchez-Pinto, L. Nelson
    Venable, Laura Ruth
    Fahrenbach, John
    Churpek, Matthew M.
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 116 : 10 - 17
  • [5] Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods
    Kandpal, Manoj
    Davuluri, Ramana, V
    [J]. STATISTICS AND APPLICATIONS, 2020, 18 (01): : 253 - 268
  • [6] Comparison of Sampling Methods for Imbalanced Data Classification in Random Forest
    Paing, May Phu
    Pintavirooj, C.
    Tungjitkusolmun, Supan
    Choomchuay, Somsak
    Hamamoto, Kazuhiko
    [J]. 2018 11TH BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BMEICON 2018), 2018,
  • [7] Variable selection and prediction of uniaxial compressive strength and modulus of elasticity by random forest
    Matin, S. S.
    Farahzadi, L.
    Makaremi, S.
    Chelgani, S. Chehreh
    Sattari, Gh.
    [J]. APPLIED SOFT COMPUTING, 2018, 70 : 980 - 987
  • [8] Comparison of Variable Selection Methods in Random Forests for Genomic Data Sets
    Szymczak, Silke
    Malley, James
    Franke, Andre
    [J]. HUMAN HEREDITY, 2013, 76 (02) : 88 - 89
  • [9] A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
    Menze, Bjoern H.
    Kelm, B. Michael
    Masuch, Ralf
    Himmelreich, Uwe
    Bachert, Peter
    Petrich, Wolfgang
    Hamprecht, Fred A.
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [10] A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
    Bjoern H Menze
    B Michael Kelm
    Ralf Masuch
    Uwe Himmelreich
    Peter Bachert
    Wolfgang Petrich
    Fred A Hamprecht
    [J]. BMC Bioinformatics, 10