A Comprehensive Study of Eleven Feature Selection Algorithms and their Impact on Text Classification

被引:0
|
作者
Vora, Suchi [1 ]
Yang, Hui [1 ]
机构
[1] San Francisco State Univ, Dept Comp Sci, San Francisco, CA 94132 USA
来源
关键词
feature selection/ranking algorithms; classification algorithms; comparison and evaluation;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Feature selection has been routinely used as a preprocessing step to remove irrelevant features and conquer the "curse of dimensionality". In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are selected from the original feature space; hence, easy to interpret. A large host of feature selection algorithms has been proposed in the literature. This has created a critical issue: which algorithm should one use? Moreover, how does a feature selection method affect the performance of a given classification algorithm? This paper addresses these issues by (1) presenting an open source software system that integrates eleven feature selection algorithms and five common classifiers; and (2) systematically comparing and evaluating the selected features and their impact over these five classifiers using five datasets. Specifically, this system includes ten commonly adopted filter-based feature selection algorithms: ChiSquare, Information Gain, Fisher Score, Gini Index, Kruskal-Wallis, Laplacian Score, ReliefF, FCBF, CFS, and mRmR. It also includes one state-of-the-art embedded approach built upon Random Forests. The five classifiers are SVM, Random Forests, Naive Bayes, kNN and C4.5 Decision Tree. Comprehensive evaluations consisting of around 1000 experiments were conducted over five text datasets. Several approximately equivalent groups (AEG), where algorithms in the same group select highly similar features, have been identified. Suitable feature-selection-classifier combinations have also been identified. For example, Chi-square and Information Gain form an AEG. Furthermore, Gini Index or Kruskal-Wallis together with SVM often produces classification performance that is comparable with or better than using all the original features. Such results will provide empirical guidelines for the data analytic community.
引用
收藏
页码:440 / 449
页数:10
相关论文
共 50 条
  • [1] Impact of feature selection techniques in Text Classification: An Experimental study
    Basha, S. Rahamat
    Rani, J. Keziya
    Yadav, J. J. C. Prasad
    Kumar, G. Ravi
    [J]. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2019, : 39 - 51
  • [2] A Comprehensive Study of Text Classification Algorithms
    Vijayan, Vikas K.
    Bindu, K. R.
    Parameswaran, Latha
    [J]. 2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 1109 - 1113
  • [3] Feature Selection For Text Classification Using Genetic Algorithms
    Bidi, Noria
    Elberrichi, Zakaria
    [J]. PROCEEDINGS OF 2016 8TH INTERNATIONAL CONFERENCE ON MODELLING, IDENTIFICATION & CONTROL (ICMIC 2016), 2016, : 806 - 810
  • [4] Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study
    Raho, Ghazi
    Al-Shalabi, Riyad
    Kanaan, Ghassan
    Asma'aNassar
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2015, 6 (02) : 192 - 195
  • [5] Information-theoretic feature selection algorithms for text classification
    Novovicová, J
    Malík, A
    [J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 3272 - 3277
  • [6] Impact of Feature Selection and Engineering in the Classification of Handwritten Text
    Kaushik, Anupama
    Gupta, Himanshu
    Latwal, Digvijay Singh
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 2598 - 2601
  • [7] A COMPREHENSIVE EVALUATION OF FEATURE SELECTION ALGORITHMS IN HYPERSPECTRAL IMAGE CLASSIFICATION
    Vijouyeh, Hamed G.
    Taskin, Gulsen
    [J]. 2016 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2016, : 489 - 492
  • [8] Ensemble feature selection for single-label text classification: a comprehensive analytical study
    Bekir Parlak
    [J]. Neural Computing and Applications, 2023, 35 : 19235 - 19251
  • [9] Ensemble feature selection for single-label text classification: a comprehensive analytical study
    Parlak, Bekir
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (26): : 19235 - 19251
  • [10] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780