Power Law for Text Categorization

被引:0
|
作者
Liu, Wuying [1 ]
Wang, Lin [2 ]
Yi, Mianzhu [1 ]
机构
[1] PLA Univ Foreign Languages, Luoyang 471003, Henan, Peoples R China
[2] Natl Univ Def Technol, Changsha 410073, Hunan, Peoples R China
关键词
Text Categorization; Power Law; Online Binary TC; Batch Multi-Category TC; TREC;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization (TC) is a challenging issue, and the corresponding algorithms can be used in many applications. This paper addresses the online multi-category TC problem abstracted from the applications of online binary TC and batch multi-category TC. Most applications are concerned about the space-time performance of TC algorithms. Through the investigation of the token frequency distribution in an email collection and a Chinese web document collection, this paper re-examines the power law and proposes a random sampling ensemble Bayesian (RSEB) TC algorithm. Supported by a token level memory to store labeled documents, the RSEB algorithm uses a text retrieval approach to solve text categorization problems. The experimental results show that the RSEB algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements both in the TREC email spam filtering task and the Chinese web document classifying task.
引用
收藏
页码:131 / 143
页数:13
相关论文
共 50 条
  • [1] Text Categorization: Implementation
    Jo, Taeho
    Studies in Big Data, 2019, 45 : 129 - 156
  • [2] Noisy text categorization
    Vinciarelli, A
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (12) : 1882 - 1895
  • [3] Noisy text categorization
    Vinciarelli, A
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, 2004, : 554 - 557
  • [4] Text categorization with ILA
    Sever, H
    Gorur, A
    Tolun, MR
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 300 - 307
  • [5] Automated Text Categorization
    Patel, Atul
    Pathak, Samprati
    Khan, Md Irfan
    ICSPC'21: 2021 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION (ICPSC), 2021, : 16 - 20
  • [6] Neural Text Categorizer for Exclusive Text Categorization
    Jo, Taeho
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2008, 4 (02): : 77 - 86
  • [7] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [8] Text categorization with WEKA: A survey
    Merlini, Donatella
    Rossini, Martina
    MACHINE LEARNING WITH APPLICATIONS, 2021, 4
  • [9] Web Text Categorization on GBODSS
    Hu, Mingsheng
    Jia, Zhijuan
    ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 599 - +
  • [10] Comparison of Text Categorization Algorithms
    SHI Yong-feng
    Wuhan University Journal of Natural Sciences, 2004, (05) : 798 - 804