Uncertainty-based noise reduction and term selection in text categorization

被引:0
|
作者
Peters, C [1 ]
Koster, CHA [1 ]
机构
[1] Univ Nijmegen, Dept Comp Sci, Nijmegen, Netherlands
来源
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified chi(2) (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters-21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general "install-and-forget" term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.
引用
收藏
页码:248 / 267
页数:20
相关论文
共 50 条
  • [31] Text categorization using activation based term set
    [J]. Pushpa, M., 1600, International Journal of Computer Science Issues (IJCSI) (09): : 4 - 3
  • [32] Graph-Based Term Weighting for Text Categorization
    Malliaros, Fragkiskos D.
    Skianis, Konstantinos
    [J]. PROCEEDINGS OF THE 2015 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM 2015), 2015, : 1473 - 1479
  • [33] Text categorization based on frequent patterns with term frequency
    Chen, XY
    Chen, Y
    Wang, L
    Hu, YF
    [J]. PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1610 - 1615
  • [34] An uncertainty-based method of photointerpretation
    Thierry, B
    Lowell, K
    [J]. PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 2001, 67 (01): : 65 - 72
  • [35] A WordNet-based approach to feature selection in text categorization
    Zhang, K
    Sun, J
    Wang, B
    [J]. INTELLIGENT INFORMATION PROCESSING II, 2005, 163 : 475 - 484
  • [36] Feature Selection Method Based on Crossed Centroid for Text Categorization
    Yang, Jieming
    Liu, Zhiying
    Qu, Zhaoyang
    Wang, Jing
    [J]. 2014 15TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2014, : 11 - 15
  • [37] Impact of Instance Selection on kNN-Based Text Categorization
    Barigou, Fatiha
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (02): : 418 - 434
  • [38] CLDA: Feature selection for text categorization based on constrained LDA
    Cui Zifeng
    Xu Baowen
    Zhang Weifeng
    Jiang Dawei
    Xu Junling
    [J]. ICSC 2007: INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, PROCEEDINGS, 2007, : 702 - +
  • [39] Two step POS selection for SVM based text categorization
    Masuyama, T
    Nakagawa, H
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (02): : 373 - 379
  • [40] Feature selection based on feature interactions with application to text categorization
    Tang, Xiaochuan
    Dai, Yuanshun
    Xiang, Yanping
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 120 : 207 - 216