A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

被引:7
|
作者
Thanh-Tung Nguyen [1 ]
Zhao, He [2 ]
Huang, Joshua Zhexue [3 ]
Thuy Thi Nguyen [4 ]
Li, Mark Junjie [3 ]
机构
[1] Thuyloi Univ, Fac Comp Sci & Engn, Hanoi, Vietnam
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[3] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[4] Vietnam Natl Univ Agr, Fac Informat Technol, Hanoi, Vietnam
关键词
Subspace feature selection; Regression; Classification; Random forests; Data mining; High-dimensional data; SELECTION;
D O I
10.1007/978-3-319-18032-8_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply p-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.
引用
收藏
页码:459 / 470
页数:12
相关论文
共 50 条
  • [1] Random forests for high-dimensional longitudinal data
    Capitaine, Louis
    Genuer, Robin
    Thiebaut, Rodolphe
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (01) : 166 - 184
  • [2] Stratified sampling for feature subspace selection in random forests for high dimensional data
    Ye, Yunming
    Wu, Qingyao
    Huang, Joshua Zhexue
    Ng, Michael K.
    Li, Xutao
    [J]. PATTERN RECOGNITION, 2013, 46 (03) : 769 - 787
  • [3] Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data
    Conn, Daniel
    Ngun, Tuck
    Li, Gang
    Ramirez, Christina M.
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2019, 91 (09):
  • [4] Interaction Detection with Random Forests in High-Dimensional Data
    Winham, Stacey
    Wang, Xin
    de Andrade, Mariza
    Freimuth, Robert
    Colby, Colin
    Huebner, Marianne
    Biernacka, Joanna
    [J]. GENETIC EPIDEMIOLOGY, 2012, 36 (02) : 142 - 142
  • [5] A Novel Feature Subspace Selection Method in Random Forests for High Dimensional Data
    Wang, Yisen
    Xia, Shu-Tao
    [J]. 2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 4383 - 4389
  • [6] On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
    Schwarz, Daniel F.
    Koenig, Inke R.
    Ziegler, Andreas
    [J]. BIOINFORMATICS, 2010, 26 (14) : 1752 - 1758
  • [7] Clustering High-Dimensional Data via Random Sampling and Consensus
    Traganitis, Panagiotis A.
    Slavakis, Konstantinos
    Giannakis, Georgios B.
    [J]. 2014 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2014, : 307 - 311
  • [8] Continuous Conditional Random Fields in Predicting High-Dimensional Data
    Purbarani, Sumarsih Condroayu
    Sanabila, H. R.
    Wibisono, Ari
    Jatmiko, Wisnu
    [J]. 2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 427 - 432
  • [9] A hybrid feature selection method for high-dimensional data
    Taheri, Nooshin
    Nezamabadi-pour, Hossein
    [J]. 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 141 - 145
  • [10] SNP interaction detection with Random Forests in high-dimensional genetic data
    Winham, Stacey J.
    Colby, Colin L.
    Freimuth, Robert R.
    Wang, Xin
    de Andrade, Mariza
    Huebner, Marianne
    Biernacka, Joanna M.
    [J]. BMC BIOINFORMATICS, 2012, 13