Ensemble with Divisive Bagging for Feature Selection in Big Data

被引:1
|
作者
Park, Yousung [1 ]
Kwon, Tae Yeon [2 ]
机构
[1] Korea Univ, Dept Stat, 145 Anam Ro, Seoul 02841, South Korea
[2] Hankuk Univ Foreign Studies, Dept Int Finance, 81 Oedae Ro, Yongin 17035, Gyeonggi Do, South Korea
基金
新加坡国家研究基金会;
关键词
Feature selection; Bagging; Voting system; Ensemble; Big data; Feature importance; C55; C52; C63; C80; C15; C51; P-VALUES; BIASED-ESTIMATION; REGRESSION; LASSO;
D O I
10.1007/s10614-024-10741-y
中图分类号
F [经济];
学科分类号
02 ;
摘要
We introduce Ensemble with Divisive Bagging (EDB), a new feature selection method in linear models, to address the excessive selection of features in big data due to deflated p-values. Extensive simulations show that EDB derives parsimonious models without loss of predictive performance compared to lasso, ridge, elastic-net, LARS, and FS. We also show that EDB estimates feature importance in linear models more accurately compared to Random Forest, XGBoost, and CatBoost. Additionally, we apply EDB to feature selection in models for house prices and loan defaults. Our findings highlight the advantages of EDB: (1) effectively addressing deflated p-values and preventing the inclusion of extraneous features; (2) ensuring unbiased coefficient estimation; (3) adaptability to various models relying on p-value-based inferences; (4) construction of statistically explainable models with feature attribution and importance by preserving inferences based on a linear model and p-values; and (5) allowing application to linear economic models without altering the previous functional form of the model.
引用
收藏
页数:34
相关论文
共 50 条
  • [41] Prediction of functional outcomes of schizophrenia with genetic biomarkers using a bagging ensemble machine learning method with feature selection
    Lin, Eugene
    Lin, Chieh-Hsin
    Lane, Hsien-Yuan
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [42] Extensions to Online Feature Selection Using Bagging and Boosting
    Ditzler, Gregory
    LaBarck, Joseph
    Ritchie, James
    Rosen, Gail
    Polikar, Robi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (09) : 4504 - 4509
  • [43] Feature selection techniques in the context of big data: taxonomy and analysis
    Abdulwahab, Hudhaifa Mohammed
    Ajitha, S.
    Saif, Mufeed Ahmed Naji
    APPLIED INTELLIGENCE, 2022, 52 (12) : 13568 - 13613
  • [44] Feature selection methods and genomic big data: a systematic review
    Tadist, Khawla
    Najah, Said
    Nikolov, Nikola S.
    Mrabti, Fatiha
    Zahi, Azeddine
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [45] A Feature Selection Method for Comparision of Each Concept in Big Data
    Nakanishi, Takafumi
    2015 IEEE/ACIS 14TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2015, : 229 - 234
  • [46] Towards Scalable and Accurate Online Feature Selection for Big Data
    Yu, Kui
    Wu, Xindong
    Ding, Wei
    Pei, Jian
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 660 - 669
  • [47] A greedy feature selection algorithm for Big Data of high dimensionality
    Tsamardinos, Ioannis
    Borboudakis, Giorgos
    Katsogridakis, Pavlos
    Pratikakis, Polyvios
    Christophides, Vassilis
    MACHINE LEARNING, 2019, 108 (02) : 149 - 202
  • [48] Feature Selection in Big Data using Filter Based Techniques
    Srinivas, Sumitra K.
    Kancharla, Gangadhara Rao
    2019 4TH MEC INTERNATIONAL CONFERENCE ON BIG DATA AND SMART CITY (ICBDSC), 2019, : 139 - 145
  • [49] Feature Selection with Annealing for Computer Vision and Big Data Learning
    Barbu, Adrian
    She, Yiyuan
    Ding, Liangjing
    Gramajo, Gary
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (02) : 272 - 286
  • [50] Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey
    Seliya, Naeem
    2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), 2019, : 346 - 356