Eliminating high-degree biased character bigrams for dimensionality reduction in Chinese text categorization

被引:0
|
作者
Xue, DJ [1 ]
Sun, MS [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Natl Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High dimensionality of feature space is a main obstacle for Text Categorization (TC). In a candidate feature set consisting of Chinese character bigrams, there exist a number of bigrams which are high-degree biased according to character frequencies. Usually, these bigrams are likely to survive for their strength of discriminating documents after the process of feature selection. However, most of them are useless for document categorization because of the weakness in representing document contents. The paper firstly defines a criterion to identify the high-degree: biased Chinese bigrams. Then, two schemes called sigma-BR1 and sigma-BR2 are proposed to deal with these bigrams: the former directly eliminates them from the feature set whereas the latter replaces them with the corresponding significant characters involved. Experimental results show that the high-degree biased bigrams should be eliminated from the feature set, and the sigma-BR1 scheme is quite effective for further dimensionality reduction in Chinese text categorization, after a feature selection process with a Chi-CIG score function.
引用
收藏
页码:197 / 208
页数:12
相关论文
共 17 条
  • [1] Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in Chinese text categorization
    Xue, DJ
    Sun, MS
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 584 - 595
  • [2] A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization
    Li, Jingyang
    Sun, Maosong
    Zhang, Xian
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 545 - 552
  • [3] Dimensionality reduction by semantic mapping in text categorization
    Corrêa, RF
    Ludermir, TB
    [J]. NEURAL INFORMATION PROCESSING, 2004, 3316 : 1032 - 1037
  • [4] Distributional character clustering for chinese text categorization
    Zhou, XZ
    Wu, ZH
    [J]. PRICAI 2004: TRENDS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3157 : 575 - 584
  • [5] Chinese Text Categorization Using the Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    [J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
  • [6] Aggressive Dimensionality Reduction with Reinforcement Local Feature Selection for Text Categorization
    Zheng, Wenbin
    Qian, Yuntao
    [J]. ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, PT I, 2010, 6319 : 365 - 372
  • [7] A high performance prototype system for Chinese text categorization
    Fan, Xinghua
    [J]. MICAI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4293 : 1017 - 1026
  • [8] Dimensionality Reduction Approach for High Dimensional Text Documents
    Reddy, G. Suresh
    [J]. 2016 INTERNATIONAL CONFERENCE ON ENGINEERING & MIS (ICEMIS), 2016,
  • [9] Dimensionality Reduction with Category Information Fusion and Non-negative Matrix Factorization for Text Categorization
    Zheng, Wenbin
    Qian, Yuntao
    Tang, Hong
    [J]. ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, PT III, 2011, 7004 : 505 - +
  • [10] Dimensionality Reduction by Locally Linear Discriminant Analysis for Handwritten Chinese Character Recognition
    Gao, Xue
    Guo, Jinzhi
    Jin, Lianwen
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (10): : 2533 - 2543