A Comprehensive Study of Eleven Feature Selection Algorithms and their Impact on Text Classification

被引:0
|
作者
Vora, Suchi [1 ]
Yang, Hui [1 ]
机构
[1] San Francisco State Univ, Dept Comp Sci, San Francisco, CA 94132 USA
来源
关键词
feature selection/ranking algorithms; classification algorithms; comparison and evaluation;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Feature selection has been routinely used as a preprocessing step to remove irrelevant features and conquer the "curse of dimensionality". In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are selected from the original feature space; hence, easy to interpret. A large host of feature selection algorithms has been proposed in the literature. This has created a critical issue: which algorithm should one use? Moreover, how does a feature selection method affect the performance of a given classification algorithm? This paper addresses these issues by (1) presenting an open source software system that integrates eleven feature selection algorithms and five common classifiers; and (2) systematically comparing and evaluating the selected features and their impact over these five classifiers using five datasets. Specifically, this system includes ten commonly adopted filter-based feature selection algorithms: ChiSquare, Information Gain, Fisher Score, Gini Index, Kruskal-Wallis, Laplacian Score, ReliefF, FCBF, CFS, and mRmR. It also includes one state-of-the-art embedded approach built upon Random Forests. The five classifiers are SVM, Random Forests, Naive Bayes, kNN and C4.5 Decision Tree. Comprehensive evaluations consisting of around 1000 experiments were conducted over five text datasets. Several approximately equivalent groups (AEG), where algorithms in the same group select highly similar features, have been identified. Suitable feature-selection-classifier combinations have also been identified. For example, Chi-square and Information Gain form an AEG. Furthermore, Gini Index or Kruskal-Wallis together with SVM often produces classification performance that is comparable with or better than using all the original features. Such results will provide empirical guidelines for the data analytic community.
引用
收藏
页码:440 / 449
页数:10
相关论文
共 50 条
  • [41] Effective feature selection technique for text classification
    Seetha, Hari
    Murty, M. Narasimha
    Saravanan, R.
    [J]. INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (03) : 165 - 184
  • [42] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    [J]. PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [43] Feature selection improves text classification accuracy
    不详
    [J]. IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75
  • [44] Feature selection for text classification with Naive Bayes
    Chen, Jingnian
    Huang, Houkuan
    Tian, Shengfeng
    Qu, Youli
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
  • [45] Higher order feature selection for text classification
    Bakus, J
    Kamel, MS
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 9 (04) : 468 - 491
  • [46] Optimal Feature Selection for Imbalanced Text Classification
    Khurana A.
    Verma O.P.
    [J]. IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
  • [47] Composite Feature Extraction and Selection for Text Classification
    Wan, Chuan
    Wang, Yuling
    Liu, Yaoze
    Ji, Jinchao
    Feng, Guozhong
    [J]. IEEE ACCESS, 2019, 7 : 35208 - 35219
  • [48] Higher order feature selection for text classification
    Jan Bakus
    Mohamed S. Kamel
    [J]. Knowledge and Information Systems, 2006, 9 : 468 - 491
  • [49] Comparative Study of Feature Selection Methods for Medical Full Text Classification
    Adriano Goncalves, Carlos
    Lorenzo Iglesias, Eva
    Borrajo, Lourdes
    Camacho, Rui
    Seara Vieira, Adrian
    Goncalves, Celia Talma
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING (IWBBIO 2019), PT II, 2019, 11466 : 550 - 560
  • [50] A comparative study of feature selection methods for binary text streams classification
    de Moraes, Matheus Bernardelli
    Sampaio Gradvohl, Andre Leon
    [J]. EVOLVING SYSTEMS, 2021, 12 (04) : 997 - 1013