Ensemble with Divisive Bagging for Feature Selection in Big Data

被引:1
|
作者
Park, Yousung [1 ]
Kwon, Tae Yeon [2 ]
机构
[1] Korea Univ, Dept Stat, 145 Anam Ro, Seoul 02841, South Korea
[2] Hankuk Univ Foreign Studies, Dept Int Finance, 81 Oedae Ro, Yongin 17035, Gyeonggi Do, South Korea
基金
新加坡国家研究基金会;
关键词
Feature selection; Bagging; Voting system; Ensemble; Big data; Feature importance; C55; C52; C63; C80; C15; C51; P-VALUES; BIASED-ESTIMATION; REGRESSION; LASSO;
D O I
10.1007/s10614-024-10741-y
中图分类号
F [经济];
学科分类号
02 ;
摘要
We introduce Ensemble with Divisive Bagging (EDB), a new feature selection method in linear models, to address the excessive selection of features in big data due to deflated p-values. Extensive simulations show that EDB derives parsimonious models without loss of predictive performance compared to lasso, ridge, elastic-net, LARS, and FS. We also show that EDB estimates feature importance in linear models more accurately compared to Random Forest, XGBoost, and CatBoost. Additionally, we apply EDB to feature selection in models for house prices and loan defaults. Our findings highlight the advantages of EDB: (1) effectively addressing deflated p-values and preventing the inclusion of extraneous features; (2) ensuring unbiased coefficient estimation; (3) adaptability to various models relying on p-value-based inferences; (4) construction of statistically explainable models with feature attribution and importance by preserving inferences based on a linear model and p-values; and (5) allowing application to linear economic models without altering the previous functional form of the model.
引用
收藏
页数:34
相关论文
共 50 条
  • [21] Towards ultrahigh dimensional feature selection for big data
    Tan, Mingkui
    Tsang, Ivor W.
    Wang, Li
    Journal of Machine Learning Research, 2014, 15 : 1371 - 1429
  • [22] Feature Selection Using Genetic Algorithm for Big Data
    Saidi, Rania
    Ncir, Waad Bouaguel
    Essoussi, Nadia
    INTERNATIONAL CONFERENCE ON ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS (AMLTA2018), 2018, 723 : 352 - 361
  • [23] Feature selection for bagging of support vector machines
    Li, Guo-Zheng
    Liu, Tian-Yu
    PRICAI 2006: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4099 : 271 - 277
  • [24] An online approach for feature selection for classification in big data
    Nazar, Nasrin Banu
    Senthilkumar, Radha
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2017, 25 (01) : 163 - 171
  • [25] Towards Ultrahigh Dimensional Feature Selection for Big Data
    Tan, Mingkui
    Tsang, Ivor W.
    Wang, Li
    JOURNAL OF MACHINE LEARNING RESEARCH, 2014, 15 : 1371 - 1429
  • [26] Streaming feature selection algorithms for big data: A survey
    AlNuaimi, Noura
    Masud, Mohammad Mehedy
    Serhani, Mohamed Adel
    Zaki, Nazar
    APPLIED COMPUTING AND INFORMATICS, 2022, 18 (1/2) : 113 - 135
  • [27] Scalable and Accurate Online Feature Selection for Big Data
    Yu, Kui
    Wu, Xindong
    Ding, Wei
    Pei, Jian
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2016, 11 (02)
  • [28] Distributed Evolutionary Feature Selection for Big Data Processing
    Bouaguel, Waad
    Ben NCir, Chiheb Eddine
    VIETNAM JOURNAL OF COMPUTER SCIENCE, 2022, 09 (03) : 313 - 332
  • [29] Improved Feature Selection Model for Big Data Analytics
    El-Hasnony, Ibrahim M.
    Barakat, Sherif I.
    Elhoseny, Mohamed
    Mostafa, Reham R.
    IEEE ACCESS, 2020, 8 : 66989 - 67004
  • [30] Reducing Data Complexity in Feature Extraction and Feature Selection for Big Data Security Analytics
    Sisiaridis, Dimitrios
    Markowitch, Olivier
    2018 1ST INTERNATIONAL CONFERENCE ON DATA INTELLIGENCE AND SECURITY (ICDIS 2018), 2018, : 43 - 48