Optimal Trees Selection for Classification via Out-of-Bag Assessment and Sub-Bagging

被引：21

作者：

Khan, Zardad ^{[1
,2
]}

Gul, Naz ^{[1
]}

Faiz, Nosheen ^{[1
,2
]}

Gul, Asma ^{[3
]}

Adler, Werner ^{[4
]}

Lausen, Berthold ^{[2
,4
]}

机构：

[1] Abdul Wali Khan Univ Mardan, Dept Stat, Mardan 23200, Pakistan

[2] Univ Essex, Dept Math Sci, Colchester CO4 3SQ, Essex, England

[3] Shaheed Benazir Bhutto Women Univ Peshawar, Dept Stat, Peshawar 25000, Pakistan

[4] Univ Erlangen Nurnberg, Dept Biometry & Epidemiol, D-91054 Erlangen, Germany

来源：

IEEE ACCESS | 2021年 / 9卷

基金：

英国经济与社会研究理事会;

关键词：

Vegetation; Training data; Random forests; Training; Forestry; Benchmark testing; Regression tree analysis; Tree selection; classification; ensemble learning; out-of-bag sample; random forest; sub-bagging; RANDOM FOREST ALGORITHM; STABILITY ASSESSMENT; ENSEMBLE; CLASSIFIERS; REDUCTION;

D O I：

10.1109/ACCESS.2021.3055992

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The effect of training data size on machine learning methods has been well investigated over the past two decades. The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases. We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation. Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation. In the first method, corresponding out-of-bag (OOB) observations are used in both individual and collective performance assessment for each tree. Trees are ranked based on their individual performance on the OOB observations. A certain number of top ranked trees is selected and starting from the most accurate tree, subsequent trees are added one by one and their impact is recorded by using the OOB observations left out from the bootstrap sample taken for the tree being added. A tree is selected if it improves predictive accuracy of the ensemble. In the second approach, trees are grown on random subsets, taken without replacement-known as sub-bagging, of the training data instead of bootstrap samples (taken with replacement). The remaining observations from each sample are used in both individual and collective assessments for each corresponding tree similar to the first method. Analysis on 21 benchmark datasets and simulations studies show improved performance of the modified methods in comparison to OTE and other state-of-the-art methods.

引用

页码：28591 / 28607

页数：17

共 22 条

[1] Out-of-bag estimation of the optimal sample size in bagging
Martinez-Munoz, Gonzalo
Suarez, Alberto
[J]. PATTERN RECOGNITION, 2010, 43 (01) : 143 - 152
[2] Optimal model selection for k-nearest neighbours ensemble via sub-bagging and sub-sampling with feature weighting
Gul, Naz
Mashwani, Wali Khan
Aamir, Muhammad
Aldahmani, Saeed
Khan, Zardad
[J]. ALEXANDRIA ENGINEERING JOURNAL, 2023, 72 : 157 - 168
[3] MDR-bagging: The improvement of multifactor dimensionality reduction using bagging predictors with out-of-bag estimation
Ha, Min-Jin
Lee, Eun-Kyung
Park, Taesung
[J]. GENETIC EPIDEMIOLOGY, 2007, 31 (06) : 629 - 630
[4] Out-of-Bag Estimation of the Optimal Hyperparameter in SubBag Ensemble Method
Zhang, Gai-Ying
Zhang, Chun-Xia
Zhang, Jiang-She
[J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2010, 39 (10) : 1877 - 1892
[5] Robust model selection using the out-of-bag bootstrap in linear regression
Fazli Rabbi
Alamgir Khalil
Ilyas Khan
Muqrin A. Almuqrin
Umair Khalil
Mulugeta Andualem
[J]. Scientific Reports, 12
[6] Robust model selection using the out-of-bag bootstrap in linear regression
Rabbi, Fazli
Khalil, Alamgir
Khan, Ilyas
Almuqrin, Muqrin A.
Khalil, Umair
Andualem, Mulugeta
[J]. SCIENTIFIC REPORTS, 2022, 12 (01)
[7] BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification
Salles, Thiago
Goncalves, Marcos
Rodrigues, Victor
Rocha, Leonardo
[J]. SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 353 - 362
[8] Properties of Direct Multi-Step Ahead Prediction of Chaotic Time Series and Out-of-Bag Estimate for Model Selection
Kurogi, Shuichi
Shigematsu, Ryosuke
Ono, Kohei
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2014), PT II, 2014, 8835 : 421 - 428
[9] Important variable assessment and electricity price forecasting based on regression tree models: classification and regression trees, Bagging and Random Forests
Gonzalez, Camino
Mira-McWilliams, Jose
Juarez, Isabel
[J]. IET GENERATION TRANSMISSION & DISTRIBUTION, 2015, 9 (11) : 1120 - 1128
[10] Optimal Feature Selection for SVM based Weed Classification via Visual Analysis
Shahbudin, S.
Hussain, A.
Samad, S. A.
Mustafa, M. M.
Ishak, A. J.
[J]. TENCON 2010: 2010 IEEE REGION 10 CONFERENCE, 2010, : 1647 - 1650

← 1 2 3 →