Uncertainty-based noise reduction and term selection in text categorization

被引:0
|
作者
Peters, C [1 ]
Koster, CHA [1 ]
机构
[1] Univ Nijmegen, Dept Comp Sci, Nijmegen, Netherlands
来源
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified chi(2) (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters-21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general "install-and-forget" term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.
引用
收藏
页码:248 / 267
页数:20
相关论文
共 50 条
  • [1] Uncertainty and term selection in text categorization
    Peters, CMEE
    Koster, CHA
    [J]. INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2003, 11 (01) : 115 - 137
  • [2] Noise reduction to text categorization based on density for KNN
    Li, RL
    Hu, YF
    [J]. 2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 3119 - 3124
  • [3] Text categorization by a machine-learning-based term selection
    Fernández, J
    Montañés, E
    Díaz, I
    Ranilla, J
    Combarro, EF
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 253 - 262
  • [4] A Model for Term Selection in Text Categorization Problems
    Cannas, Laura Maria
    Dessi, Nicoletta
    Dessi, Stefania
    [J]. 2012 23RD INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2012, : 169 - 173
  • [5] Relative term-frequency based feature selection for text categorization
    Yang, SM
    Wu, XB
    Deng, ZH
    Zhang, M
    Yang, DQ
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1432 - 1436
  • [6] An empirical study of feature selection for text categorization based on term weightage
    How, BC
    Narayanan, K
    [J]. IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 599 - 602
  • [7] Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval
    Li, Shenshen
    He, Chen
    Xu, Xing
    Shen, Fumin
    Yang, Yang
    Shen, Heng Tao
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3172 - 3180
  • [8] A supervised term selection technique for effective text categorization
    Basu, Tanmay
    Murthy, C. A.
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2016, 7 (05) : 877 - 892
  • [9] Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers
    Andersen, Jakob Smedegaard
    Maalej, Walid
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1536 - 1546
  • [10] A supervised term selection technique for effective text categorization
    Tanmay Basu
    C. A. Murthy
    [J]. International Journal of Machine Learning and Cybernetics, 2016, 7 : 877 - 892