Exploiting extremely rare features in text categorization

被引:0
|
作者
Schonhofen, Peter [1 ]
Benczur, Andras A. [1 ]
机构
[1] Hungarian Acad Sci, Informat Lab, Comp & Automat Res Inst, H-1111 Budapest, Hungary
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n-grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.
引用
收藏
页码:759 / 766
页数:8
相关论文
共 50 条
  • [21] An Extensive Selection of Features as Combinations for Automatic Text Categorization
    Sohail, Aamir
    Kotha, Chaitanya
    Chavali, Rishanth Kanakadri
    Meghana, Krishna
    Manne, Suneetha
    Fatima, Sameen
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON FRONTIERS OF INTELLIGENT COMPUTING: THEORY AND APPLICATIONS (FICTA) 2013, 2014, 247 : 371 - 378
  • [22] Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization
    Zhang, Ye
    Lease, Matthew
    Wallace, Byron C.
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 155 - 160
  • [23] Exploiting probabilistic topic models to improve text categorization under class imbalance
    Chen, Enhong
    Lin, Yanggang
    Xiong, Hui
    Luo, Qiming
    Ma, Haiping
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (02) : 202 - 214
  • [24] Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization
    Gliozzo, Alfio
    Strapparava, Carlo
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 553 - 560
  • [25] Exploiting category information and document information to improve term weighting for text categorization
    Li, Jingyang
    Sun, Maosong
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2007, 4394 : 587 - +
  • [26] Improving Semantic Scene Categorization by Exploiting Audio-Visual Features
    Zhu, Songhao
    Yan, Junchi
    Liu, Yuncai
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), 2009, : 435 - 440
  • [27] Categorization of text documents taking into account some structural features
    V. V. Gulin
    A. B. Frolov
    [J]. Journal of Computer and Systems Sciences International, 2016, 55 : 96 - 105
  • [28] Categorization of text documents taking into account some structural features
    Gulin, V. V.
    Frolov, A. B.
    [J]. JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 2016, 55 (01) : 96 - 105
  • [29] Text Categorization for Authorship based on the Features of Lingual Conceptual Expression
    Zhang, Quan
    Zhang, Yun-Liang
    Yuan, Yi
    [J]. PACLIC 21: THE 21ST PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2007, : 515 - 521
  • [30] Text categorization for authorship based on the features of lingual conceptual expression
    Institute of Acoustics, CAS, Beijing 100080, China
    不详
    [J]. PACLIC - Pacific Asia Conf. Lang., Inf. Comput., Proc., 2007, (515-521):