Relative term-frequency based feature selection for text categorization

被引:0
|
作者
Yang, SM [1 ]
Wu, XB [1 ]
Deng, ZH [1 ]
Zhang, M [1 ]
Yang, DQ [1 ]
机构
[1] Peking Univ, Dept Comp Sci & Technol, Beijing 100871, Peoples R China
关键词
text categorization; feature selection; relative term frequency;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic feature selection methods such as document frequency (DF), information gain (IG), mutual information (MI) and so on are commonly applied in the preprocess of text categorization in order to reduce the originally high feature dimension to a bearable level, meanwhile reduce noise to improve precision. Generally they assess a specific term by calculating its occurrences among individual categories or in the entire corpus, where "occurring in a document" is simply defined as occurring at least once. A major drawback of this measure is that, for a single document, it might count a recurrent term the same as a rare term, while the former term is obviously more informative and should less likely be removed. In this paper we propose a possible approach to overcome this problem, which adjusts the occurrences count according to the relative term frequency, thus stressing those recurrent words in each document. While it can be applied to all feature selection methods, we implemented it on several of them and see notable improvements in the performances.
引用
收藏
页码:1432 / 1436
页数:5
相关论文
共 50 条
  • [1] Comparison of term frequency and document frequency based feature selection metrics in text categorization
    Azam, Nouman
    Yao, JingTao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4760 - 4768
  • [2] t-Test feature selection approach based on term frequency for text categorization
    Wang, Deqing
    Zhang, Hui
    Liu, Rui
    Lv, Weifeng
    Wang, Datao
    [J]. PATTERN RECOGNITION LETTERS, 2014, 45 : 1 - 10
  • [3] Text Classification based on Word Subspace with Term-Frequency
    Shimomoto, Erica K.
    Souza, Lincon S.
    Gatto, Bernardo B.
    Fukui, Kazuhiro
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [4] Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization
    Jin, Chuanxin
    Ma, Tinghuai
    Hou, Rongtao
    Tang, Meili
    Tian, Yuan
    Al-Dhelaan, Abdullah
    Al-Rodhaan, Mznah
    [J]. IETE JOURNAL OF RESEARCH, 2015, 61 (04) : 351 - 362
  • [5] An empirical study of feature selection for text categorization based on term weightage
    How, BC
    Narayanan, K
    [J]. IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 599 - 602
  • [6] Text Categorization Based on Clustering Feature Selection
    Zhou, Xiaofei
    Hu, Yue
    Guo, Li
    [J]. 2ND INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT, ITQM 2014, 2014, 31 : 398 - 405
  • [7] Term-frequency surrogates in text similarity computations
    NICTA Victoria Research Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia
    [J]. ADCS - Proc. Thirteenth Australasian Doc. Comput. Symp, 2008, (3-10):
  • [8] Categorical Term Frequency Probability Based Feature Selection for Document Categorization
    Li, Qiang
    He, Liang
    Lin, Xin
    [J]. 2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 60 - 65
  • [9] Feature selection based on feature interactions with application to text categorization
    Tang, Xiaochuan
    Dai, Yuanshun
    Xiang, Yanping
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 120 : 207 - 216
  • [10] An extended document frequency metric for feature selection in text categorization
    Xu, Yan
    Wang, Bin
    Li, JinTao
    Jing, Hongfang
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 71 - +