Ensemble with Divisive Bagging for Feature Selection in Big Data

被引:1
|
作者
Park, Yousung [1 ]
Kwon, Tae Yeon [2 ]
机构
[1] Korea Univ, Dept Stat, 145 Anam Ro, Seoul 02841, South Korea
[2] Hankuk Univ Foreign Studies, Dept Int Finance, 81 Oedae Ro, Yongin 17035, Gyeonggi Do, South Korea
基金
新加坡国家研究基金会;
关键词
Feature selection; Bagging; Voting system; Ensemble; Big data; Feature importance; C55; C52; C63; C80; C15; C51; P-VALUES; BIASED-ESTIMATION; REGRESSION; LASSO;
D O I
10.1007/s10614-024-10741-y
中图分类号
F [经济];
学科分类号
02 ;
摘要
We introduce Ensemble with Divisive Bagging (EDB), a new feature selection method in linear models, to address the excessive selection of features in big data due to deflated p-values. Extensive simulations show that EDB derives parsimonious models without loss of predictive performance compared to lasso, ridge, elastic-net, LARS, and FS. We also show that EDB estimates feature importance in linear models more accurately compared to Random Forest, XGBoost, and CatBoost. Additionally, we apply EDB to feature selection in models for house prices and loan defaults. Our findings highlight the advantages of EDB: (1) effectively addressing deflated p-values and preventing the inclusion of extraneous features; (2) ensuring unbiased coefficient estimation; (3) adaptability to various models relying on p-value-based inferences; (4) construction of statistically explainable models with feature attribution and importance by preserving inferences based on a linear model and p-values; and (5) allowing application to linear economic models without altering the previous functional form of the model.
引用
收藏
页数:34
相关论文
共 50 条
  • [31] Feature Selection for Big Visual Data: Overview and Challenges
    Bolon-Canedo, Veronica
    Remeseiro, Beatriz
    Cancela, Brais
    IMAGE ANALYSIS AND RECOGNITION (ICIAR 2018), 2018, 10882 : 136 - 143
  • [32] On Feature Selection, Bias-Variance, and Bagging
    Munson, N. Arthur
    Caruana, Rich
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 144 - +
  • [33] Bagging-based spectral clustering ensemble selection
    Jia, Jianhua
    Xiao, Xuan
    Liu, Bingxiang
    Jiao, Licheng
    PATTERN RECOGNITION LETTERS, 2011, 32 (10) : 1456 - 1467
  • [34] Intrusion Detection System using Bagging Ensemble Selection
    Sreenath, M.
    Udhayan, J.
    2015 IEEE INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICETECH), 2015, : 4 - 7
  • [35] Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection
    Mohamad, Masurah
    Selamat, Ali
    Krejcar, Ondrej
    Crespo, Ruben Gonzalez
    Herrera-Viedma, Enrique
    Fujita, Hamido
    ELECTRONICS, 2021, 10 (23)
  • [36] Data Feature Selection Methods on Distributed Big Data Processing Platforms
    Catalkaya, Mehmet Burak
    Kalipsiz, Oya
    Aktas, Mehmet S.
    Turgut, Umut Orcun
    2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 133 - 138
  • [37] Feature selection for imbalanced data with deep sparse autoencoders ensemble
    Massi, Michela Carlotta
    Gasperoni, Francesca
    Ieva, Francesca
    Paganoni, Anna Maria
    STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (03) : 376 - 395
  • [38] Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data
    Lango, Mateusz
    Stefanowski, Jerzy
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2018, 50 (01) : 97 - 127
  • [39] Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data
    Mateusz Lango
    Jerzy Stefanowski
    Journal of Intelligent Information Systems, 2018, 50 : 97 - 127
  • [40] Prediction of functional outcomes of schizophrenia with genetic biomarkers using a bagging ensemble machine learning method with feature selection
    Eugene Lin
    Chieh-Hsin Lin
    Hsien-Yuan Lane
    Scientific Reports, 11