A new feature selection method for text classification

被引:8
|
作者
Uchyigit, Gulden [1 ]
Clark, Keith [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London SW7 2AZ, England
关键词
feature selection; text classification; statistical inference;
D O I
10.1142/S0218001407005466
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (x(2)) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F-1 and F-2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.
引用
收藏
页码:423 / 438
页数:16
相关论文
共 50 条
  • [21] Research on Feature Selection Method in Chinese Text Automatic Classification
    Hong, Ying
    Shao, Xiwen
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING INNOVATION, 2015, 12 : 1759 - 1763
  • [22] Two-stage Feature Selection Method for Text Classification
    Li Xi
    Dai Hang
    Wang Mingwen
    [J]. MINES 2009: FIRST INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY, VOL 1, PROCEEDINGS, 2009, : 234 - +
  • [23] Research on feature selection method in Chinese text automatic classification
    Hong, Ying
    Geng, Zengmin
    [J]. ENERGY SCIENCE AND APPLIED TECHNOLOGY, 2016, : 359 - 361
  • [24] A novel filter feature selection method for text classification: Extensive Feature Selector
    Parlak, Bekir
    Uysal, Alper Kursat
    [J]. JOURNAL OF INFORMATION SCIENCE, 2023, 49 (01) : 59 - 78
  • [25] A New Big Data Feature Selection Approach for Text Classification
    Amazal, Houda
    Kissi, Mohamed
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [26] A New Method of Feature Selection for Flow Classification
    Sun, Meifeng
    Chen, Jingtao
    Zhang, Yun
    Shi, Shangzhe
    [J]. INTERNATIONAL CONFERENCE ON APPLIED PHYSICS AND INDUSTRIAL ENGINEERING 2012, PT C, 2012, 24 : 1729 - 1736
  • [27] A New Method of Feature Selection for Flow Classification
    Sun, Meifeng
    Chen, Jingtao
    Zhang, Yun
    Shi, Shangzhe
    [J]. 2010 INTERNATIONAL COLLOQUIUM ON COMPUTING, COMMUNICATION, CONTROL, AND MANAGEMENT (CCCM2010), VOL I, 2010, : 299 - 302
  • [28] A New Method of Text Feature Selection for Knowledge Discovery
    Zhang, Li
    Liu, Xing
    An, Rong
    Zhao, Xin
    Yi, Kejia
    [J]. PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS RESEARCH AND MECHATRONICS ENGINEERING, 2015, 121 : 1787 - 1790
  • [29] Dynamic feature selection in text classification
    Doan, Son
    Horiguchi, Susumu
    [J]. INTELLIGENT CONTROL AND AUTOMATION, 2006, 344 : 664 - 675
  • [30] Contextual feature selection for text classification
    Paradis, Francois
    Nie, Jian-Yun
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (02) : 344 - 352