A new feature selection method for text classification

被引:8
|
作者
Uchyigit, Gulden [1 ]
Clark, Keith [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London SW7 2AZ, England
关键词
feature selection; text classification; statistical inference;
D O I
10.1142/S0218001407005466
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (x(2)) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F-1 and F-2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.
引用
收藏
页码:423 / 438
页数:16
相关论文
共 50 条
  • [1] Efficient Method for Feature Selection in Text Classification
    Sun, Jian
    Zhang, Xiang
    Liao, Dan
    Chang, Victor
    [J]. 2017 INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICET), 2017,
  • [2] Feature Selection Method of Text Tendency Classification
    Li, Yanling
    Dai, Guanzhong
    Li, Gang
    [J]. FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 34 - +
  • [3] An enhanced feature selection method for text classification
    Kang, Jinbeom
    Lee, Eunshil
    Hong, Kwanghee
    Park, Jeahyun
    Kim, Taehwan
    Park, Juyoung
    Choi, Joongmin
    Yang, Jaeyoung
    [J]. PROCEEDINGS OF THE SECOND IASTED INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, 2006, : 36 - 41
  • [4] A new feature selection method for handling redundant information in text classification
    You-wei Wang
    Li-zhou Feng
    [J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 221 - 234
  • [5] A new feature selection method for handling redundant information in text classification
    Wang, You-wei
    Feng, Li-zhou
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (02) : 221 - 234
  • [6] A New Feature Selection Method for Text Classification Based on Independent Feature Space Search
    Liu, Yong
    Ju, Shenggen
    Wang, Junfeng
    Su, Chong
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020
  • [7] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    [J]. PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [8] A new feature selection method based on frequent and associated itemsets for text classification
    Farghaly, Heba Mamdouh
    Abd El-Hafeez, Tarek
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (25):
  • [9] A parallel feature selection method study for text classification
    Li, Zhao
    Lu, Wei
    Sun, Zhanquan
    Xing, Weiwei
    [J]. NEURAL COMPUTING & APPLICATIONS, 2017, 28 : S513 - S524
  • [10] A Hybrid Feature Selection Method For Vietnamese Text Classification
    Nguyen Tri Hai
    Tuan Dinh Le
    Nguyen Hoang Nghia
    Vu Thanh Nguyen
    [J]. 2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2015, : 91 - 96