Comparison of term frequency and document frequency based feature selection metrics in text categorization

被引:91
|
作者
Azam, Nouman [1 ]
Yao, JingTao [1 ]
机构
[1] Univ Regina, Dept Comp Sci, Regina, SK S4S 0A2, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Text categorization; Feature selection metrics; Term frequency; Document frequency; ALGORITHM;
D O I
10.1016/j.eswa.2011.09.160
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:4760 / 4768
页数:9
相关论文
共 50 条
  • [1] An extended document frequency metric for feature selection in text categorization
    Xu, Yan
    Wang, Bin
    Li, JinTao
    Jing, Hongfang
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 71 - +
  • [2] Categorical Term Frequency Probability Based Feature Selection for Document Categorization
    Li, Qiang
    He, Liang
    Lin, Xin
    [J]. 2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 60 - 65
  • [3] Relative term-frequency based feature selection for text categorization
    Yang, SM
    Wu, XB
    Deng, ZH
    Zhang, M
    Yang, DQ
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1432 - 1436
  • [4] OPTIMAL FEATURE SUBSET SELECTION BASED ON COMBINING DOCUMENT FREQUENCY AND TERM FREQUENCY FOR TEXT CLASSIFICATION
    Karpagalingam, Thirumoorthy
    Karuppaiah, Muneeswaran
    [J]. COMPUTING AND INFORMATICS, 2020, 39 (05) : 881 - 906
  • [5] Optimal feature subset selection based on combining document frequency and term frequency for text classification
    Karpagalingam, Thirumoorthy
    Karuppaiah, Muneeswaran
    [J]. Computing and Informatics, 2021, 39 (05) : 881 - 906
  • [6] CLASS DOCUMENT FREQUENCY AS A LEARNED FEATURE FOR TEXT CATEGORIZATION
    Sharma, Anand
    Kuh, Anthony
    [J]. 2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, : 2988 - 2993
  • [7] t-Test feature selection approach based on term frequency for text categorization
    Wang, Deqing
    Zhang, Hui
    Liu, Rui
    Lv, Weifeng
    Wang, Datao
    [J]. PATTERN RECOGNITION LETTERS, 2014, 45 : 1 - 10
  • [8] Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization
    Jin, Chuanxin
    Ma, Tinghuai
    Hou, Rongtao
    Tang, Meili
    Tian, Yuan
    Al-Dhelaan, Abdullah
    Al-Rodhaan, Mznah
    [J]. IETE JOURNAL OF RESEARCH, 2015, 61 (04) : 351 - 362
  • [9] A Novel Feature Selection Approach Based on Document Frequency of Segmented Term Frequency
    Zhou, Hongfang
    Han, Shuang
    Liu, Yibin
    [J]. IEEE ACCESS, 2018, 6 : 53811 - 53821
  • [10] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135