Exploiting extremely rare features in text categorization

被引:0
|
作者
Schonhofen, Peter [1 ]
Benczur, Andras A. [1 ]
机构
[1] Hungarian Acad Sci, Informat Lab, Comp & Automat Res Inst, H-1111 Budapest, Hungary
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n-grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.
引用
收藏
页码:759 / 766
页数:8
相关论文
共 50 条
  • [1] Exploiting hierarchy in text categorization
    Weigend A.S.
    Wiener E.D.
    Pedersen J.O.
    [J]. Information Retrieval, 1999, 1 (3): : 193 - 216
  • [2] Distributional features for text categorization
    Xue, Xiao-Bing
    Zhou, Zhi-Hua
    [J]. MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 497 - 508
  • [3] Fully Automatic Text Categorization by Exploiting WordNet
    Li, Jianqiang
    Zhao, Yu
    Liu, Bo
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2009, 5839 : 1 - 12
  • [4] Text Categorization by Weighted Features
    Fu, Junfeng
    Liang, Liang
    Zheng, Jinkun
    Zhou, Xin
    [J]. 2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2018), 2018, : 544 - 547
  • [5] Distributional Features for Text Categorization
    Xue, Xiao-Bing
    Zhou, Zhi-Hua
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (03) : 428 - 442
  • [6] Exploiting Ontology Recommendation Using Text Categorization Approach
    Sarwar, Muhammad Azeem
    Ahmed, Mansoor
    Habib, Asad
    Khalid, Muhammad
    Ali, M. Akhtar
    Raza, Mohsin
    Hussain, Shahid
    Ahmed, Ghufran
    [J]. IEEE ACCESS, 2021, 9 : 27304 - 27322
  • [7] Collaborative text categorization via exploiting sparse coefficients
    Lina Yao
    Quan Z. Sheng
    Xianzhi Wang
    Shengrui Wang
    Xue Li
    Sen Wang
    [J]. World Wide Web, 2018, 21 : 373 - 394
  • [8] Exploiting semantic resources for large scale text categorization
    Jian Qiang Li
    Yu Zhao
    Bo Liu
    [J]. Journal of Intelligent Information Systems, 2012, 39 : 763 - 788
  • [9] Exploiting semantic resources for large scale text categorization
    Li, Jian Qiang
    Zhao, Yu
    Liu, Bo
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2012, 39 (03) : 763 - 788
  • [10] Collaborative text categorization via exploiting sparse coefficients
    Yao, Lina
    Sheng, Quan Z.
    Wang, Xianzhi
    Wang, Shengrui
    Li, Xue
    Wang, Sen
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2018, 21 (02): : 373 - 394