Comparison of term frequency and document frequency based feature selection metrics in text categorization

被引:91
|
作者
Azam, Nouman [1 ]
Yao, JingTao [1 ]
机构
[1] Univ Regina, Dept Comp Sci, Regina, SK S4S 0A2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Text categorization; Feature selection metrics; Term frequency; Document frequency; ALGORITHM;
D O I
10.1016/j.eswa.2011.09.160
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:4760 / 4768
页数:9
相关论文
共 50 条
  • [11] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135
  • [12] Five new feature selection metrics in text categorization
    Song, Fengxi
    Zhang, David
    Xu, Yong
    Wang, Jizhong
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (06) : 1085 - 1101
  • [13] Feature selection based on term frequency deviation rate for text classification
    Hongfang Zhou
    Yiming Ma
    Xiang Li
    [J]. Applied Intelligence, 2021, 51 : 3255 - 3274
  • [14] Feature selection based on term frequency deviation rate for text classification
    Zhou, Hongfang
    Ma, Yiming
    Li, Xiang
    [J]. APPLIED INTELLIGENCE, 2021, 51 (06) : 3255 - 3274
  • [15] Text categorization based on frequent patterns with term frequency
    Chen, XY
    Chen, Y
    Wang, L
    Hu, YF
    [J]. PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1610 - 1615
  • [16] Memetic feature selection for multilabel text categorization using label frequency difference
    Lee, Jaesung
    Yu, Injun
    Park, Jaegyun
    Kim, Dae-Won
    [J]. INFORMATION SCIENCES, 2019, 485 : 263 - 280
  • [17] An Evaluation of Existing and New Feature Selection Metrics in Text Categorization
    Tasci, Serafettin
    Gungor, Tunga
    [J]. 23RD INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2008, : 238 - 243
  • [18] A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information
    Farek, Lazhar
    Benaidja, Amira
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (07) : 20193 - 20214
  • [19] A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information
    Lazhar Farek
    Amira Benaidja
    [J]. Multimedia Tools and Applications, 2024, 83 : 20193 - 20214
  • [20] Fusing Gini Index and Term Frequency for Text Feature Selection
    Wu, Lin
    Wang, Yongbin
    Zhang, Shengyan
    Zhang, Yannan
    [J]. 2017 IEEE THIRD INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2017), 2017, : 280 - 283