A new Centroid-Based Classification model for text categorization

被引:21
|
作者
Liu, Chuan [1 ]
Wang, Wenyong [1 ,2 ]
Tu, Guanghui [1 ]
Xiang, Yu [1 ]
Wang, Siyang [3 ]
Lv, Fengmao [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Sichuan, Peoples R China
[2] Shanghai Hefu Artificial Intelligence Technol Grp, Hefu Inst UESTC, Chengdu 611731, Sichuan, Peoples R China
[3] Univ Calif San Diego, Dept Math, La Jolla, CA 92093 USA
关键词
Text categorization; Centroid-Based Classifier; Machine learning; Gravitation Model; ALGORITHM; SMOTE; CLASSIFIERS; FRAMEWORK;
D O I
10.1016/j.knosys.2017.08.020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The automatic text categorization technique has gained significant attention among researchers because of the increasing availability of online text information. Therefore, many different learning approaches have been designed in the text categorization field. Among them, the widely used method is the Centroid-Based Classifier (CBC) due to its theoretical simplicity and computational efficiency. However, the classification accuracy of CBC greatly depends on the data distribution. Thus it leads to a misfit model and also has poor classification performance when the data distribution is highly skewed. In this paper, a new classification model named as Gravitation Model (GM) is proposed to solve the class-imbalanced classification problem. In the training phase, each class is weighted by a mass factor, which can be learned from the training data, to indicate data distribution of the corresponding class. In the testing phase, a new document will be assigned to a particular class with the max gravitational force. The performance comparisons with CBC and its variants based on the results of experiments conducted on twelve real datasets show that the proposed gravitation model consistently outperforms CBC together with the Class-Feature-Centroid Classifier (CFC). Also, it obtains the classification accuracy competitive to the DragPushing (DP) method while it maintains a more stable performance. Thus, the proposed gravitation model is proved to be less over-fitting and has higher learning ability than CBC model. (C) 2017 The Authors. Published by Elsevier B.V.
引用
收藏
页码:15 / 26
页数:12
相关论文
共 50 条
  • [1] A New Centroid-Based Classifier for Text Categorization
    Chen, Lifei
    Ye, Yanfang
    Jiang, Qingshan
    [J]. 2008 22ND INTERNATIONAL WORKSHOPS ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOLS 1-3, 2008, : 1217 - +
  • [2] Class normalization in centroid-based text categorization
    Lertnattee, Verayuth
    Theeramunkong, Thanaruk
    [J]. INFORMATION SCIENCES, 2006, 176 (12) : 1712 - 1738
  • [3] A Framework of Centroid-Based Methods for Text Categorization
    Wang, Dandan
    Chen, Qingcai
    Wang, Xiaolong
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (02): : 245 - 254
  • [4] Effect of term distributions on centroid-based text categorization
    Lertnattee, V
    Theeramunkong, T
    [J]. INFORMATION SCIENCES, 2004, 158 : 89 - 115
  • [5] An improvement of centroid-based classification algorithm for text classification
    Cataltepe, Zehra
    Aygun, Eser
    [J]. 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, : 952 - 956
  • [6] Supervised term weighting centroid-based classifiers for text categorization
    Nguyen, Tam T.
    Chang, Kuiyu
    Hui, Siu Cheung
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 35 (01) : 61 - 85
  • [7] Term-length normalization for centroid-based text categorization
    Lertnattee, V
    Theeramunkong, T
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2003, 2773 : 850 - 856
  • [8] Supervised term weighting centroid-based classifiers for text categorization
    Tam T. Nguyen
    Kuiyu Chang
    Siu Cheung Hui
    [J]. Knowledge and Information Systems, 2013, 35 : 61 - 85
  • [9] Combining homogeneous classifiers for centroid-based text classification
    Lertnattee, V
    Theeramunkong, T
    [J]. ISCC 2002: SEVENTH INTERNATIONAL SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2002, : 1034 - 1039
  • [10] Analysis of inverse class frequency in centroid-based text classification
    Lertnattee, V
    Theeramunkong, T
    [J]. IEEE INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES 2004 (ISCIT 2004), PROCEEDINGS, VOLS 1 AND 2: SMART INFO-MEDIA SYSTEMS, 2004, : 1171 - 1176