Genetic Programming for Feature Selection Based on Feature Removal Impact in High-Dimensional Symbolic Regression

被引:4
|
作者
Al-Helali, Baligh [1 ,2 ]
Chen, Qi [1 ,2 ]
Xue, Bing [1 ,2 ]
Zhang, Mengjie [1 ,2 ]
机构
[1] Victoria Univ Wellington, Ctr Data Sci & Artificial Intelligence, Wellington 6140, New Zealand
[2] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington 6140, New Zealand
关键词
Feature selection; genetic programming; high dimensionality; symbolic regression; FEATURE RANKING; CLASSIFICATION; EVOLUTIONARY;
D O I
10.1109/TETCI.2024.3369407
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Symbolic regression is increasingly important for discovering mathematical models for various prediction tasks. It works by searching for the arithmetic expressions that best represent a target variable using a set of input features. However, as the number of features increases, the search process becomes more complex. To address high-dimensional symbolic regression, this work proposes a genetic programming for feature selection method based on the impact of feature removal on the performance of SR models. Unlike existing Shapely value methods that simulate feature absence at the data level, the proposed approach suggests removing features at the model level. This approach circumvents the production of unrealistic data instances, which is a major limitation of Shapely value and permutation-based methods. Moreover, after calculating the importance of the features, a cut-off strategy, which works by injecting a number of random features and utilising their importance to automatically set a threshold, is proposed for selecting important features. The experimental results on artificial and real-world high-dimensional data sets show that, compared with state-of-the-art feature selection methods using the permutation importance and Shapely value, the proposed method not only improves the SR accuracy but also selects smaller sets of features.
引用
收藏
页码:2269 / 2282
页数:14
相关论文
共 50 条
  • [31] Feature selection for high-dimensional temporal data
    Michail Tsagris
    Vincenzo Lagani
    Ioannis Tsamardinos
    BMC Bioinformatics, 19
  • [32] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [33] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
    Verleysen, Michel
    ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,
  • [34] Extremely High-Dimensional Feature Selection via Feature Generating Samplings
    Li, Shutao
    Wei, Dan
    IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (06) : 737 - 747
  • [35] High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso
    Yamada, Makoto
    Jitkrittum, Wittawat
    Sigal, Leonid
    Xing, Eric P.
    Sugiyama, Masashi
    NEURAL COMPUTATION, 2014, 26 (01) : 185 - 207
  • [36] Feature selection algorithm based on optimized genetic algorithm and the application in high-dimensional data processing
    Feng, Guilian
    PLOS ONE, 2024, 19 (05):
  • [37] A Nested Genetic Algorithm for feature selection in high-dimensional cancer Microarray datasets
    Sayed, Sabah
    Nassef, Mohammad
    Badr, Amr
    Farag, Ibrahim
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 121 : 233 - 243
  • [38] A hybrid fuzzy feature selection algorithm for high-dimensional regression problems: An mRMR-based framework
    Aghaeipoor, Fatemeh
    Javidi, Mohammad Masoud
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 162
  • [39] A semi-parametric approach to feature selection in high-dimensional linear regression models
    Liu, Yuyang
    Pi, Pengfei
    Luo, Shan
    COMPUTATIONAL STATISTICS, 2023, 38 (02) : 979 - 1000
  • [40] A semi-parametric approach to feature selection in high-dimensional linear regression models
    Yuyang Liu
    Pengfei Pi
    Shan Luo
    Computational Statistics, 2023, 38 : 979 - 1000