Automatic Chinese Text Categorization System Based on Mutual Information

被引:0
|
作者
Lu, Zhimao [1 ]
Shi, Hong [1 ]
Zhang, Qi [1 ]
Yuan, Chaoyue [1 ]
机构
[1] Harbin Engn Univ, Informat & Commun Engn Coll, Harbin, Heilongjiang Pr, Peoples R China
关键词
Automatic Text Categorization; Feature Selection; Mutual Information; KNN; SVM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection is a key step in automatic text categorization system and it has a significant impact on classification result. In this paper we do research on mutual information (MI) which is one basic method of feature selection. Firstly, we found out three main problems of MI by analyzing the formula of MI theoretically and systematically - the MI loss, the information difference among categories, and the excessive emphasis on low-frequency terms. Then, to solve these three questions, we proposed an improved feature selection method by calculating the absolute values of MI and calculating the differential values between maximum and average of MI. At last, we tested our method using K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier respectively, and we also compared it with the original method on Chinese corpus. The results demonstrate the effectiveness and feasibility of the proposed method.
引用
收藏
页码:4986 / 4990
页数:5
相关论文
共 50 条
  • [41] CHINESE TEXT CATEGORIZATION STUDY BASED ON FEATURE WEIGHT LEARNING
    Zhan, Yan
    Chen, Hao
    Zhang, Su-Fang
    Zheng, Mei
    [J]. PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-6, 2009, : 1723 - +
  • [42] A Fast Algorithm for Chinese Text Categorization Based on Key Tree
    Liu Xin
    Liu Renren
    He Wenjing
    [J]. INFORMATION TECHNOLOGY FOR MANUFACTURING SYSTEMS II, PTS 1-3, 2011, 58-60 : 1106 - +
  • [43] Design of Chinese Text Categorization Classifier Based on Attribute Bagging
    Zhang, Xiang
    Zhou, Mingquan
    Dong, Lili
    Ye, Na
    [J]. 2009 INTERNATIONAL CONFERENCE ON BUSINESS INTELLIGENCE AND FINANCIAL ENGINEERING, PROCEEDINGS, 2009, : 201 - 204
  • [44] The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization
    Hu, Yan
    Wu, Wei
    Miao, Miao
    [J]. IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 640 - 645
  • [45] WikiAutoCat: Information Retrieval System for Automatic Categorization of Wikipedia Articles
    Refaei, Nesma
    Hemayed, Elsayed E.
    Mansour, Riham
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (12) : 8095 - 8109
  • [46] WikiAutoCat: Information Retrieval System for Automatic Categorization of Wikipedia Articles
    Nesma Refaei
    Elsayed E. Hemayed
    Riham Mansour
    [J]. Arabian Journal for Science and Engineering, 2018, 43 : 8095 - 8109
  • [47] Examples Initialization in Chinese Text Categorization
    Cheng, Shi
    Shi, Yuhui
    Qin, Quande
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2013, : 967 - 971
  • [48] Automatic text categorization based on content analysis with cognitive situation models
    Guo, Yi
    Shao, Zhiqing
    Hua, Nan
    [J]. INFORMATION SCIENCES, 2010, 180 (05) : 613 - 630
  • [49] On a New Model for Automatic Text Categorization Based on Vector Space Model
    Suzuki, Makoto
    Yamagishi, Naohide
    Ishidat, Takashi
    Gotot, Masayuki
    Hirasawa, Shigeichi
    [J]. IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010), 2010, : 3152 - 3159
  • [50] Stemming Malay Text and Its Application in Automatic Text Categorization
    Yasukawa, Michiko
    Lim, Hui Tian
    Yokoo, Hidetoshi
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2009, E92D (12): : 2351 - 2359