Automatic Chinese Text Categorization System Based on Mutual Information

被引:0
|
作者
Lu, Zhimao [1 ]
Shi, Hong [1 ]
Zhang, Qi [1 ]
Yuan, Chaoyue [1 ]
机构
[1] Harbin Engn Univ, Informat & Commun Engn Coll, Harbin, Heilongjiang Pr, Peoples R China
关键词
Automatic Text Categorization; Feature Selection; Mutual Information; KNN; SVM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection is a key step in automatic text categorization system and it has a significant impact on classification result. In this paper we do research on mutual information (MI) which is one basic method of feature selection. Firstly, we found out three main problems of MI by analyzing the formula of MI theoretically and systematically - the MI loss, the information difference among categories, and the excessive emphasis on low-frequency terms. Then, to solve these three questions, we proposed an improved feature selection method by calculating the absolute values of MI and calculating the differential values between maximum and average of MI. At last, we tested our method using K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier respectively, and we also compared it with the original method on Chinese corpus. The results demonstrate the effectiveness and feasibility of the proposed method.
引用
收藏
页码:4986 / 4990
页数:5
相关论文
共 50 条
  • [1] Research on Chinese Text Automatic Categorization Based on VSM
    Tong Xiao-Jun
    Cui Ming-Gen
    Song Guo-Long
    [J]. 2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 3863 - +
  • [2] The Improvement Research of Mutual Information Algorithm for Text Categorization
    Kai, Lu
    Li, Chen
    [J]. KNOWLEDGE ENGINEERING AND MANAGEMENT , ISKE 2013, 2014, 278 : 225 - 232
  • [3] An enhanced text categorization method based on improved text frequency approach and mutual information algorithm
    Pei Zhili
    Shi Xiaohu
    Marchese, Maurizio
    Liang Yanchun
    [J]. PROGRESS IN NATURAL SCIENCE-MATERIALS INTERNATIONAL, 2007, 17 (12) : 1494 - 1500
  • [4] An enhanced text categorization method based on improved text frequency approach and mutual information algorithm
    Maurizio Marchese
    [J]. Progress in Natural Science:Materials International, 2007, (12) : 1494 - 1500
  • [5] Text Categorization Method Based on Improved Mutual Information and Characteristic Weights Evaluation Algorithms
    Pei, Zhili
    Shi, Xiaohu
    Marchese, Maurizio
    Liang, Yanchun
    [J]. FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 4, PROCEEDINGS, 2007, : 87 - +
  • [6] Automatic text categorization based on angle distribution
    Liu, T
    Guo, J
    [J]. Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 3797 - 3801
  • [7] Automatic Category Structure Generation and Categorization of Chinese Text Documents
    Yang, Hsin-Chang
    Lee, Chung-Hong
    [J]. LECTURE NOTES IN COMPUTER SCIENCE <D>, 2000, 1910 : 673 - 678
  • [8] Research on Enhancing the Effectiveness of the Chinese Text Automatic Categorization Based on ICTCLAS Segmentation Method
    Li, Xiangdong
    Zhang, Cheng
    [J]. PROCEEDINGS OF 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2012, : 267 - 270
  • [9] Weighted average pointwise mutual information for feature selection in text categorization
    Schneider, KM
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 252 - 263
  • [10] An automatic indexing of compound words based on mutual information for Korean text retrieval
    Kim, PK
    [J]. LIBRARY AND INFORMATION SCIENCE, 1995, (34): : 29 - 38