Bias in random forest variable importance measures: Illustrations, sources and a solution

被引:2348
|
作者
Strobl, Carolin
Boulesteix, Anne-Laure
Zeileis, Achim
Hothorn, Torsten
机构
[1] Univ Munich, Inst Stat, D-80539 Munich, Germany
[2] Tech Univ Munich, Inst Med Stat & Epidemiol, D-81675 Munich, Germany
[3] Vienna Univ Econ & Business Adm, Dept Math & Stat, A-1090 Vienna, Austria
[4] Univ Erlangen Nurnberg, Inst Med Informat Biometrie & Epidemiol, D-91054 Erlangen, Germany
关键词
D O I
10.1186/1471-2105-8-25
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Random Forest Variable Importance Spectral Indices Scheme for Burnt Forest Recovery MonitoringMultilevel RF-VIMP
    Boonprong, Sornkitja
    Cao, Chunxiang
    Chen, Wei
    Bao, Shanning
    REMOTE SENSING, 2018, 10 (06)
  • [32] Correlation and variable importance in random forests
    Gregorutti, Baptiste
    Michel, Bertrand
    Saint-Pierre, Philippe
    STATISTICS AND COMPUTING, 2017, 27 (03) : 659 - 678
  • [33] Correlation and variable importance in random forests
    Baptiste Gregorutti
    Bertrand Michel
    Philippe Saint-Pierre
    Statistics and Computing, 2017, 27 : 659 - 678
  • [34] Conditional variable importance for random forests
    Strobl, Carolin
    Boulesteix, Anne-Laure
    Kneib, Thomas
    Augustin, Thomas
    Zeileis, Achim
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [35] Correlated variable importance for random forests
    Shin, Seung Beom
    Cho, Hyung Jun
    KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (02) : 177 - 190
  • [36] Unbiased variable importance for random forests
    Loecher, Markus
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2022, 51 (05) : 1413 - 1425
  • [37] Conditional variable importance for random forests
    Carolin Strobl
    Anne-Laure Boulesteix
    Thomas Kneib
    Thomas Augustin
    Achim Zeileis
    BMC Bioinformatics, 9
  • [38] Consistent and unbiased variable selection under indepedent features using Random Forest permutation importance
    Ramosaj, Burim
    Pauly, Markus
    BERNOULLI, 2023, 29 (03) : 2101 - 2118
  • [39] Assessing agreement between permutation and dropout variable importance methods for regression and random forest models
    Bladen, Kelvyn
    Cutler, Richard
    ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (07): : 4495 - 4514
  • [40] Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
    Dunne, Robert
    Reguant, Roc
    Ramarao-Milne, Priya
    Szul, Piotr
    Sng, Letitia M. F.
    Lundberg, Mischa
    Twine, Natalie A.
    Bauer, Denis C.
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2023, 21 : 4354 - 4360