Smart Robust Feature Selection (SoFt) for imbalanced and heterogeneous data

被引:7
|
作者
Kasim, Henry [1 ]
King, Stephen [3 ]
Lee, Gary Kee Khoon [1 ]
Sirigina, Rajendra Prasad [2 ]
How, Shannon Shi Qi [1 ]
Hung, Terence Gih Guang [1 ]
机构
[1] Cent Technol & Strategy Grp, Future Intelligence Technol, Rolls Royce, Singapore, Singapore
[2] Nanyang Technol Univ, Rolls RoyceNTU Corp Lab, Singapore, Singapore
[3] Cranfield Univ, Transport & Mfg IVHM Ctr, Sch Aerosp, Bedford MK43 0AL, England
关键词
Class-imbalanced data; Heterogeneous features; Boosting algorithms; Feature Selection; CatBoost; H2O GBM;
D O I
10.1016/j.knosys.2021.107197
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Designing a smart and robust predictive model that can deal with imbalanced data and a heterogeneous set of features is paramount to its widespread adoption by practitioners. By smart, we mean the model is either parameter-free or works well with default parameters, avoiding the challenge of parameter tuning. Furthermore, a robust model should consistently achieve high accuracy regardless of any dataset (imbalance, heterogeneous set of features) or domain (such as medical, financial). To this end, a computationally inexpensive and yet robust predictive model named smart robust feature selection (SoFt) is proposed. SoFt involves selecting a learning algorithm and designing a filtering-based feature selection algorithm named multi evaluation criteria and Pareto (MECP). Two state-of-the-art gradient boosting methods (GBMs), CatBoost and H2O GBM, are considered potential candidates for learning algorithms. CatBoost is selected over H2O GBM due to its robustness with both default and tuned parameters. The MECP uses multiple parameter-free feature scores to rank the features. SoFt is validated against CatBoost with a full feature set and wrapper-based CatBoost. SoFt is robust and consistent for imbalanced datasets, i.e., average value and standard deviation of log loss are low across different folds of K-fold cross-validation. Features selected by MECP are also consistent, i.e., features selected by SoFt and wrapper-based CatBoost are consistent across different folds, demonstrating the effectiveness of MECP. For balanced datasets, MECP selects too few features, and hence, the log loss of SoFt is significantly higher than CatBoost with a full feature set. (C) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Feature Selection in Imbalanced Data
    Kamalov F.
    Thabtah F.
    Leung H.H.
    [J]. Annals of Data Science, 2023, 10 (06) : 1527 - 1541
  • [2] Predicting additive manufacturing defects with robust feature selection for imbalanced data
    Houser, Ethan
    Shashaani, Sara
    Harrysson, Ola
    Jeon, Yongseok
    [J]. IISE TRANSACTIONS, 2024, 56 (09) : 1001 - 1019
  • [3] Univariate feature selection on imbalanced data
    Chatterjee, Avishek
    Woodruff, Henry
    Lobbes, Marc
    Vallieres, Martin
    Seuntjens, Jan
    [J]. MEDICAL PHYSICS, 2019, 46 (11) : 5375 - 5375
  • [4] Evolutionary feature selection for imbalanced data
    Tusell Rey, Claudia C.
    Salinas Garcia, Viridiana
    Villuendas-Rey, Yenny
    [J]. 2023 MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, ENC, 2024,
  • [5] Causal Feature Selection With Imbalanced Data
    Ling, Zhaolong
    Wu, Jingxuan
    Zhang, Yiwen
    Zhou, Peng
    Yu, Kui
    Jiang, Bingbing
    Wu, Xindong
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [6] An Embedded Feature Selection Method for Imbalanced Data Classification
    Liu, Haoyue
    Zhou, MengChu
    Liu, Qing
    [J]. IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) : 703 - 715
  • [7] Feature Selection with Imbalanced Data for Software Defect Prediction
    Khoshgoftaar, Taghi M.
    Gao, Kehan
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 235 - +
  • [8] DYNAMIC SELECTION OF CLASSIFIERS FOR FUSING IMBALANCED HETEROGENEOUS DATA
    Sukhanov, S.
    Debes, C.
    Zoubir, A. M.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5361 - 5365
  • [9] Feature selection for high-dimensional imbalanced data
    Yin, Liuzhi
    Ge, Yong
    Xiao, Keli
    Wang, Xuehua
    Quan, Xiaojun
    [J]. NEUROCOMPUTING, 2013, 105 : 3 - 11
  • [10] A Classification Method Based on Feature Selection for Imbalanced Data
    Liu, Yi
    Wang, Yanzhen
    Ren, Xiaoguang
    Zhou, Hao
    Diao, Xingchun
    [J]. IEEE ACCESS, 2019, 7 : 81794 - 81807