Optimal Trees Selection for Classification via Out-of-Bag Assessment and Sub-Bagging

被引:21
|
作者
Khan, Zardad [1 ,2 ]
Gul, Naz [1 ]
Faiz, Nosheen [1 ,2 ]
Gul, Asma [3 ]
Adler, Werner [4 ]
Lausen, Berthold [2 ,4 ]
机构
[1] Abdul Wali Khan Univ Mardan, Dept Stat, Mardan 23200, Pakistan
[2] Univ Essex, Dept Math Sci, Colchester CO4 3SQ, Essex, England
[3] Shaheed Benazir Bhutto Women Univ Peshawar, Dept Stat, Peshawar 25000, Pakistan
[4] Univ Erlangen Nurnberg, Dept Biometry & Epidemiol, D-91054 Erlangen, Germany
基金
英国经济与社会研究理事会;
关键词
Vegetation; Training data; Random forests; Training; Forestry; Benchmark testing; Regression tree analysis; Tree selection; classification; ensemble learning; out-of-bag sample; random forest; sub-bagging; RANDOM FOREST ALGORITHM; STABILITY ASSESSMENT; ENSEMBLE; CLASSIFIERS; REDUCTION;
D O I
10.1109/ACCESS.2021.3055992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The effect of training data size on machine learning methods has been well investigated over the past two decades. The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases. We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation. Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation. In the first method, corresponding out-of-bag (OOB) observations are used in both individual and collective performance assessment for each tree. Trees are ranked based on their individual performance on the OOB observations. A certain number of top ranked trees is selected and starting from the most accurate tree, subsequent trees are added one by one and their impact is recorded by using the OOB observations left out from the bootstrap sample taken for the tree being added. A tree is selected if it improves predictive accuracy of the ensemble. In the second approach, trees are grown on random subsets, taken without replacement-known as sub-bagging, of the training data instead of bootstrap samples (taken with replacement). The remaining observations from each sample are used in both individual and collective assessments for each corresponding tree similar to the first method. Analysis on 21 benchmark datasets and simulations studies show improved performance of the modified methods in comparison to OTE and other state-of-the-art methods.
引用
收藏
页码:28591 / 28607
页数:17
相关论文
共 22 条
  • [1] Out-of-bag estimation of the optimal sample size in bagging
    Martinez-Munoz, Gonzalo
    Suarez, Alberto
    [J]. PATTERN RECOGNITION, 2010, 43 (01) : 143 - 152
  • [2] Optimal model selection for k-nearest neighbours ensemble via sub-bagging and sub-sampling with feature weighting
    Gul, Naz
    Mashwani, Wali Khan
    Aamir, Muhammad
    Aldahmani, Saeed
    Khan, Zardad
    [J]. ALEXANDRIA ENGINEERING JOURNAL, 2023, 72 : 157 - 168
  • [3] MDR-bagging: The improvement of multifactor dimensionality reduction using bagging predictors with out-of-bag estimation
    Ha, Min-Jin
    Lee, Eun-Kyung
    Park, Taesung
    [J]. GENETIC EPIDEMIOLOGY, 2007, 31 (06) : 629 - 630
  • [4] Out-of-Bag Estimation of the Optimal Hyperparameter in SubBag Ensemble Method
    Zhang, Gai-Ying
    Zhang, Chun-Xia
    Zhang, Jiang-She
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2010, 39 (10) : 1877 - 1892
  • [5] Robust model selection using the out-of-bag bootstrap in linear regression
    Fazli Rabbi
    Alamgir Khalil
    Ilyas Khan
    Muqrin A. Almuqrin
    Umair Khalil
    Mulugeta Andualem
    [J]. Scientific Reports, 12
  • [6] Robust model selection using the out-of-bag bootstrap in linear regression
    Rabbi, Fazli
    Khalil, Alamgir
    Khan, Ilyas
    Almuqrin, Muqrin A.
    Khalil, Umair
    Andualem, Mulugeta
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [7] BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification
    Salles, Thiago
    Goncalves, Marcos
    Rodrigues, Victor
    Rocha, Leonardo
    [J]. SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 353 - 362
  • [8] Properties of Direct Multi-Step Ahead Prediction of Chaotic Time Series and Out-of-Bag Estimate for Model Selection
    Kurogi, Shuichi
    Shigematsu, Ryosuke
    Ono, Kohei
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2014), PT II, 2014, 8835 : 421 - 428
  • [9] Important variable assessment and electricity price forecasting based on regression tree models: classification and regression trees, Bagging and Random Forests
    Gonzalez, Camino
    Mira-McWilliams, Jose
    Juarez, Isabel
    [J]. IET GENERATION TRANSMISSION & DISTRIBUTION, 2015, 9 (11) : 1120 - 1128
  • [10] Optimal Feature Selection for SVM based Weed Classification via Visual Analysis
    Shahbudin, S.
    Hussain, A.
    Samad, S. A.
    Mustafa, M. M.
    Ishak, A. J.
    [J]. TENCON 2010: 2010 IEEE REGION 10 CONFERENCE, 2010, : 1647 - 1650