A comparison of random forest variable selection methods for classification prediction modeling

被引：616

作者：

Speiser, Jaime Lynn ^{[1
]}

Miller, Michael E. ^{[1
]}

Tooze, Janet ^{[1
]}

Ip, Edward ^{[1
]}

机构：

[1] Wake Forest Sch Med, Dept Biostat Sci, Med Ctr Blvd, Winston Salem, NC 27157 USA

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2019年 / 134卷

基金：

美国国家卫生研究院;

关键词：

Random forest; Variable selection; Feature reduction; Classification;

D O I：

10.1016/j.eswa.2019.05.028

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems. (C) 2019 Elsevier Ltd. All rights reserved.

引用

页码：93 / 101

页数：9

共 50 条

[1] Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
Eric W. Fox
Ryan A. Hill
Scott G. Leibowitz
Anthony R. Olsen
Darren J. Thornbrugh
Marc H. Weber
[J]. Environmental Monitoring and Assessment, 2017, 189
[2] Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
Fox, Eric W.
Hill, Ryan A.
Leibowitz, Scott G.
Olsen, Anthony R.
Thornbrugh, Darren J.
Weber, Marc H.
[J]. ENVIRONMENTAL MONITORING AND ASSESSMENT, 2017, 189 (07)
[3] Random forest for ordinal responses: Prediction and variable selection
Janitza, Silke
Tutz, Gerhard
Boulesteix, Anne-Laure
[J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 96 : 57 - 73
[4] Comparison of variable selection methods for clinical predictive modeling
Sanchez-Pinto, L. Nelson
Venable, Laura Ruth
Fahrenbach, John
Churpek, Matthew M.
[J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 116 : 10 - 17
[5] Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods
Kandpal, Manoj
Davuluri, Ramana, V
[J]. STATISTICS AND APPLICATIONS, 2020, 18 (01): : 253 - 268
[6] Comparison of Sampling Methods for Imbalanced Data Classification in Random Forest
Paing, May Phu
Pintavirooj, C.
Tungjitkusolmun, Supan
Choomchuay, Somsak
Hamamoto, Kazuhiko
[J]. 2018 11TH BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BMEICON 2018), 2018,
[7] Variable selection and prediction of uniaxial compressive strength and modulus of elasticity by random forest
Matin, S. S.
Farahzadi, L.
Makaremi, S.
Chelgani, S. Chehreh
Sattari, Gh.
[J]. APPLIED SOFT COMPUTING, 2018, 70 : 980 - 987
[8] Comparison of Variable Selection Methods in Random Forests for Genomic Data Sets
Szymczak, Silke
Malley, James
Franke, Andre
[J]. HUMAN HEREDITY, 2013, 76 (02) : 88 - 89
[9] A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
Menze, Bjoern H.
Kelm, B. Michael
Masuch, Ralf
Himmelreich, Uwe
Bachert, Peter
Petrich, Wolfgang
Hamprecht, Fred A.
[J]. BMC BIOINFORMATICS, 2009, 10
[10] A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
Bjoern H Menze
B Michael Kelm
Ralf Masuch
Uwe Himmelreich
Peter Bachert
Wolfgang Petrich
Fred A Hamprecht
[J]. BMC Bioinformatics, 10

← 1 2 3 4 5 →