Bias in random forest variable importance measures: Illustrations, sources and a solution

被引：2348

作者：

Strobl, Carolin

Boulesteix, Anne-Laure

Zeileis, Achim

Hothorn, Torsten

机构：

[1] Univ Munich, Inst Stat, D-80539 Munich, Germany

[2] Tech Univ Munich, Inst Med Stat & Epidemiol, D-81675 Munich, Germany

[3] Vienna Univ Econ & Business Adm, Dept Math & Stat, A-1090 Vienna, Austria

[4] Univ Erlangen Nurnberg, Inst Med Informat Biometrie & Epidemiol, D-91054 Erlangen, Germany

来源：

BMC BIOINFORMATICS | 2007年 / 8卷 / 1期

关键词：

D O I：

10.1186/1471-2105-8-25

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

引用

页数：21

共 50 条

[21] NPP estimation using random forest and impact feature variable importance analysis
Yu, Bo
Chen, Fang
Chen, Hanyue
JOURNAL OF SPATIAL SCIENCE, 2019, 64 (01) : 173 - 192
[22] Using a Random Forest proximity measure for variable importance stratification in genotypic data
Seoane, Jose A.
Day, Ian N. M.
Campbell, Colin
Casas, Juan P.
Gaunt, Tom R.
PROCEEDINGS IWBBIO 2014: INTERNATIONAL WORK-CONFERENCE ON BIOINFORMATICS AND BIOMEDICAL ENGINEERING, VOLS 1 AND 2, 2014, : 1049 - 1060
[23] Collider-stratification bias when estimating variable importance using Random Forests
Long, Stephanie
Lefebvre, Genevieve
Schuster, Tibor
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2021, 50 : 143 - 143
[24] Bias in the intervention in prediction measure in random forests: illustrations and recommendations
Nembrini, Stefano
BIOINFORMATICS, 2019, 35 (13) : 2343 - 2345
[25] On what to permute in test-based approaches for variable importance measures in Random Forests
Nembrini, Stefano
BIOINFORMATICS, 2019, 35 (15) : 2701 - 2705
[26] Identification of influential rare variants in aggregate testing using random forest importance measures
Blumhagen, Rachel Z.
Schwartz, David A.
Langefeld, Carl D.
Fingerlin, Tasha E.
ANNALS OF HUMAN GENETICS, 2023, 87 (04) : 184 - 195
[27] VARIABLE IMPORTANCE AND RANDOM FOREST CLASSIFICATION USING RADARSAT-2 POLSAR DATA
Hariharan, Siddharth
Tirodkar, Siddhesh
De, Shaunak
Bhattacharya, Avik
2014 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2014, : 1210 - 1213
[28] Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival
Ishwaran, Hemant
Lu, Min
STATISTICS IN MEDICINE, 2019, 38 (04) : 558 - 582
[29] Random forest and variable importance rankings for correlated survival data, with applications to tooth loss
Hallett, M. J.
Fan, J. J.
Su, X. G.
Levine, R. A.
Nunn, M. E.
STATISTICAL MODELLING, 2014, 14 (06) : 523 - 547
[30] Environmental variable importance for under-five mortality in Malaysia: A random forest approach
Phung, Vera Ling Hui
Oka, Kazutaka
Hijioka, Yasuaki
Ueda, Kayo
Sahani, Mazrura
Mahiyuddin, Wan Rozita Wan
SCIENCE OF THE TOTAL ENVIRONMENT, 2022, 845

← 1 2 3 4 5 →