A comparative study on text representation schemes in text categorization

被引:35
|
作者
Song, FX
Liu, SH
Yang, JY
机构
[1] Department of Computer Science, Nanjing University of Science and Technology
关键词
text categorization; text representation; support vector machines; multi-way analysis of variance; pattern recognition;
D O I
10.1007/s10044-005-0256-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: "stop words removal", "word stemming", "indexing", "weighting", and "normalization". Statistical analyses of experimental results show that performing "normalization" can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.
引用
收藏
页码:199 / 209
页数:11
相关论文
共 50 条
  • [1] A comparative study on text representation schemes in text categorization
    Fengxi Song
    Shuhai Liu
    Jingyu Yang
    [J]. Pattern Analysis and Applications, 2005, 8 : 199 - 209
  • [2] A comparative study on term weighting schemes for text categorization
    Lan, M
    Sung, SY
    Low, HB
    Tan, CL
    [J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 546 - 551
  • [3] Arabic Text Categorization: a Comparative Study of Different Representation Modes
    Elberrichi, Zakaria
    Abidi, Karima
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2012, 9 (05) : 465 - 470
  • [4] A comparative study on feature weight in text categorization
    Deng, ZH
    Tang, SW
    Yang, DQ
    Zhang, M
    Li, LY
    Xie, KQ
    [J]. ADVANCED WEB TECHNOLOGIES AND APPLICATIONS, 2004, 3007 : 588 - 597
  • [5] A fuzzy-based approach for text representation in text categorization
    Doan, S
    [J]. FUZZ-IEEE 2005: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS: BIGGEST LITTLE CONFERENCE IN THE WORLD, 2005, : 1008 - 1013
  • [6] COMPARATIVE STUDY OF TEXT REPRESENTATION METHODS
    Zhang, Nuo
    Watanabe, Toshinori
    [J]. 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 263 - 266
  • [7] Automatic Arabic text categorization: A comprehensive comparative study
    Hmeidi, Ismail
    Al-Ayyoub, Mahmoud
    Abdulla, Nawaf A.
    Almodawar, Abdalrahman A.
    Abooraig, Raddad
    Mahyoub, Nizar A.
    [J]. JOURNAL OF INFORMATION SCIENCE, 2015, 41 (01) : 114 - 124
  • [8] COMPARATIVE RESEARCH ON SHORT TEXT CATEGORIZATION
    Chang, Juan
    Lu, Xueqin
    [J]. INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE & TECHNOLOGY, PROCEEDINGS, 2009, : 335 - 338
  • [9] A comparative study for WordNet guided text representation
    Zhang, JA
    Li, CP
    [J]. AI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3809 : 883 - 887
  • [10] An incremental approach to text representation, categorization, and retrieval
    ONeil, P
    [J]. PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 714 - 717