Chinese Text Categorization Using the Character N-gram

被引:0
|
作者
Suzuki, Makoto [1 ]
Yamagishi, Naohide [1 ]
Tsai, Yi-Ching [2 ]
机构
[1] Shonan Inst Technol, 1-1-25 Tsujido Nishikaigan, Fujisawa, Kanagawa 2518511, Japan
[2] Univ King Ning, Tainan 709, Taiwan
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People's Daily 2009-2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People's Daily 2009-2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.
引用
收藏
页码:722 / 726
页数:5
相关论文
共 50 条
  • [1] Multilingual Text Categorization Using Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    Hirasawa, Shigeichi
    [J]. 2008 IEEE CONFERENCE ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS SMCIA/08, 2009, : 49 - +
  • [2] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [3] Automatic Chinese Text Classification Using N-Gram Model
    Yen, Show-Jane
    Lee, Yue-Shi
    Wu, Yu-Chieh
    Ying, Jia-Ching
    Tseng, Vincent S.
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2010, PT 3, PROCEEDINGS, 2010, 6018 : 458 - +
  • [4] Efficient n-gram construction for text categorization using feature selection techniques
    Garcia, Maximiliano
    Maldonado, Sebastian
    Vairetti, Carla
    [J]. INTELLIGENT DATA ANALYSIS, 2021, 25 (03) : 509 - 525
  • [5] The textcat Package for n-Gram Based Text Categorization in R
    Hornik, Kurt
    Mair, Patrick
    Rauch, Johannes
    Geiger, Wilhelm
    Buchta, Christian
    Feinerer, Ingo
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2013, 52 (06):
  • [6] Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features
    Deniz, Ayca
    Kiziloz, Llakan Ezgi
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 655 - 660
  • [7] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [8] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [9] A new type of feature - Loose N-gram feature in text categorization
    Zhang, Xian
    Zhu, Xiaoyan
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 1, PROCEEDINGS, 2007, 4477 : 378 - +
  • [10] A variant of n-gram based language-independent text categorization
    Graovac, Jelena
    [J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695