An extended document frequency metric for feature selection in text categorization

被引:0
|
作者
Xu, Yan [1 ]
Wang, Bin [1 ]
Li, JinTao [1 ]
Jing, Hongfang [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, 6 Kexueyuan South Rd, Beijing, Peoples R China
来源
关键词
rough set; text categorization; feature selection; document frequency;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection plays an important role in text categorization. Many sophisticated feature selection methods such as Information Gain (IG), Mutual Information (MI) and chi(2) statistic measure (CHI) have been proposed. However, when compared to these above methods, a very simple technique called Document Frequency thresholding (DF) has shown to be one of the best methods either on Chinese or English text data. A problem is that DF method is usually considered as an empirical approach and it does not consider Term Frequency (TF) factor. In this paper, we put forward an extended DF method called TFDF which combines the Term Frequency (TF) factor. Experimental results on Reuters-21578 and OHSUMED corpora show that TFDF performs much better than the original DF method.
引用
收藏
页码:71 / +
页数:3
相关论文
共 50 条
  • [31] Enhancement of DTP feature selection method for text categorization
    Moyotl-Hernández, E
    Jiménez-Salazar, H
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 719 - 722
  • [32] Feature selection for support vector machines in text categorization
    Liu, Y
    Lu, HM
    Lu, ZX
    Wang, P
    [J]. MLMTA'03: INTERNATIONAL CONFERENCE ON MACHINE LEARNING; MODELS, TECHNOLOGIES AND APPLICATIONS, 2003, : 129 - 134
  • [33] Optimal feature subset selection based on combining document frequency and term frequency for text classification
    Karpagalingam, Thirumoorthy
    Karuppaiah, Muneeswaran
    [J]. Computing and Informatics, 2021, 39 (05) : 881 - 906
  • [34] A discriminative and semantic feature selection method for text categorization
    Zong, Wei
    Wu, Feng
    Chu, Lap-Keung
    Sculli, Domenic
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION ECONOMICS, 2015, 165 : 215 - 222
  • [35] Using typical testors for feature selection in text categorization
    Pons-Porratal, Aurora
    Gil-Garcia, Reynaldo
    Berlanga-Liavori, Rafael
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2007, 4756 : 643 - +
  • [36] Feature Selection with Structural Sparse Mode for Text Categorization
    Zheng, Wenbin
    Tang, Dan
    Zhang, Haiqing
    Tang, Hong
    [J]. 2017 NINTH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC 2017), VOL 1, 2017, : 359 - 362
  • [37] Five new feature selection metrics in text categorization
    Song, Fengxi
    Zhang, David
    Xu, Yong
    Wang, Jizhong
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (06) : 1085 - 1101
  • [38] An Improved Strategy of the Feature Selection Algorithm for the Text Categorization
    Yang, Jieming
    Lu, Yixin
    Liu, Zhiying
    [J]. 2019 20TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2019, : 3 - 7
  • [39] PKIP: Feature selection in text categorization for item banks
    Nuntiyagul, A
    Naruedomkul, K
    Cercone, N
    Wongsawang, D
    [J]. ICTAI 2005: 17TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, : 212 - 216
  • [40] Feature subset selection in SOM based text categorization
    Bassiouny, S
    Nagi, M
    Hussein, MF
    [J]. IC-AI '04 & MLMTA'04 , VOL 1 AND 2, PROCEEDINGS, 2004, : 860 - 866