Empirical characterization of random forest variable importance measures

被引:746
|
作者
Archer, Kelfie J. [1 ]
Kirnes, Ryan V. [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Biostat, Richmond, VA 23298 USA
关键词
random forest; classification tree; variable importance; bootstrap aggregating;
D O I
10.1016/j.csda.2007.08.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k-nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set.. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:2249 / 2260
页数:12
相关论文
共 50 条
  • [31] Random Forest Variable Importance Spectral Indices Scheme for Burnt Forest Recovery MonitoringMultilevel RF-VIMP
    Boonprong, Sornkitja
    Cao, Chunxiang
    Chen, Wei
    Bao, Shanning
    REMOTE SENSING, 2018, 10 (06)
  • [32] Correlation and variable importance in random forests
    Gregorutti, Baptiste
    Michel, Bertrand
    Saint-Pierre, Philippe
    STATISTICS AND COMPUTING, 2017, 27 (03) : 659 - 678
  • [33] Correlation and variable importance in random forests
    Baptiste Gregorutti
    Bertrand Michel
    Philippe Saint-Pierre
    Statistics and Computing, 2017, 27 : 659 - 678
  • [34] Conditional variable importance for random forests
    Strobl, Carolin
    Boulesteix, Anne-Laure
    Kneib, Thomas
    Augustin, Thomas
    Zeileis, Achim
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [35] Correlated variable importance for random forests
    Shin, Seung Beom
    Cho, Hyung Jun
    KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (02) : 177 - 190
  • [36] Unbiased variable importance for random forests
    Loecher, Markus
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2022, 51 (05) : 1413 - 1425
  • [37] Conditional variable importance for random forests
    Carolin Strobl
    Anne-Laure Boulesteix
    Thomas Kneib
    Thomas Augustin
    Achim Zeileis
    BMC Bioinformatics, 9
  • [38] Consistent and unbiased variable selection under indepedent features using Random Forest permutation importance
    Ramosaj, Burim
    Pauly, Markus
    BERNOULLI, 2023, 29 (03) : 2101 - 2118
  • [39] Assessing agreement between permutation and dropout variable importance methods for regression and random forest models
    Bladen, Kelvyn
    Cutler, Richard
    ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (07): : 4495 - 4514
  • [40] An empirical test of six stated importance measures
    Chrzan, Keith
    Golovashkina, Natalia
    INTERNATIONAL JOURNAL OF MARKET RESEARCH, 2006, 48 (06) : 717 - 740