Power Law for Text Categorization

被引:0
|
作者
Liu, Wuying [1 ]
Wang, Lin [2 ]
Yi, Mianzhu [1 ]
机构
[1] PLA Univ Foreign Languages, Luoyang 471003, Henan, Peoples R China
[2] Natl Univ Def Technol, Changsha 410073, Hunan, Peoples R China
关键词
Text Categorization; Power Law; Online Binary TC; Batch Multi-Category TC; TREC;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization (TC) is a challenging issue, and the corresponding algorithms can be used in many applications. This paper addresses the online multi-category TC problem abstracted from the applications of online binary TC and batch multi-category TC. Most applications are concerned about the space-time performance of TC algorithms. Through the investigation of the token frequency distribution in an email collection and a Chinese web document collection, this paper re-examines the power law and proposes a random sampling ensemble Bayesian (RSEB) TC algorithm. Supported by a token level memory to store labeled documents, the RSEB algorithm uses a text retrieval approach to solve text categorization problems. The experimental results show that the RSEB algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements both in the TREC email spam filtering task and the Chinese web document classifying task.
引用
收藏
页码:131 / 143
页数:13
相关论文
共 50 条
  • [41] TCBPLK: A new method of text categorization
    Xu, Jian-Suo
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 3889 - 3892
  • [42] Informative Vector Machines for text categorization
    Stankovic, Milos
    Stankovic, Srdan
    NEUREL 2006: EIGHT SEMINAR ON NEURAL NETWORK APPLICATIONS IN ELECTRICAL ENGINEERING, PROCEEDINGS, 2006, : 99 - +
  • [43] The method of text categorization on imbalanced datasets
    Li Xin-fu
    Yu Yan
    Yin Peng
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS, 2009, : 650 - 653
  • [44] The use of bigrams to enhance text categorization
    Tan, CM
    Wang, YF
    Lee, CD
    INFORMATION PROCESSING & MANAGEMENT, 2002, 38 (04) : 529 - 546
  • [45] Combining dissimilarity spaces for text categorization
    Pinheiro, Roberto H. W.
    Cavalcanti, George D. C.
    Tsang, Ing Ren
    INFORMATION SCIENCES, 2017, 406 : 87 - 101
  • [46] Automatic Text Categorization using NTC
    Jo, Taeho
    NDT: 2009 FIRST INTERNATIONAL CONFERENCE ON NETWORKED DIGITAL TECHNOLOGIES, 2009, : 26 - 31
  • [47] Text categorization: An experiment using phrases
    Kongovi, M
    Guzman, JC
    Dasigi, V
    ADVANCES IN INFORMATION REFTRIEVAL, 2002, 2291 : 213 - 228
  • [48] Analyzing the temporal sequences for text categorization
    Luo, X
    Zincir-Heywood, AN
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2004, 3215 : 498 - 505
  • [49] Kernel-based text categorization
    Jalam, R
    Teytaud, O
    IJCNN'01: INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2001, : 1891 - 1896
  • [50] Fuzzy clustering and categorization of text documents
    Ayeldeen, Heba
    Rassanien, Aboul Ella
    Fahmy, Aly Aly
    2013 13TH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS (HIS), 2013, : 262 - 266