Multilingual Text Categorization Using Character N-gram

被引:0
|
作者
Suzuki, Makoto [1 ]
Yamagishi, Naohide [1 ]
Tsai, Yi-Ching [2 ]
Hirasawa, Shigeichi [3 ]
机构
[1] Shonan Inst Technol, Kanagawa 2518511, Japan
[2] Leader Univ, Tainan 70901, Taiwan
[3] Waseda Univ, Shinjyu Ku, Tokyo 1698555, Japan
关键词
text mining; classification; N-gram; newspaper;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequency among categories. However, in FRAM, the use of feature terms is unlimited. In the present paper, we adopt Character N-gram as feature terms improving the above-described particularity of FRAM. That is to say, the proposed method is language-independent because it does not depend on the low of grammar by using Character N-gram. Therefore, we can classify multi-language into some categories using only one program. Next, the proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from English Reuters-21578, Japanese CD-Mainichi 2002 and Chinese China Times 2005 using FRAM. As a result, we show that it has the good classification accuracy. Specifically, the recall of the proposed method is 87.8% for English, 86.0% for Japanese and 72.8% for Chinese. Although it turned out that Chinese classification accuracy was extremely low in the present experiments compared with English and Japanese, the proposed method is language-independent and provides a new perspective and has excellent potential.
引用
收藏
页码:49 / +
页数:2
相关论文
共 50 条
  • [1] Chinese Text Categorization Using the Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    [J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
  • [2] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [3] Efficient n-gram construction for text categorization using feature selection techniques
    Garcia, Maximiliano
    Maldonado, Sebastian
    Vairetti, Carla
    [J]. INTELLIGENT DATA ANALYSIS, 2021, 25 (03) : 509 - 525
  • [4] The textcat Package for n-Gram Based Text Categorization in R
    Hornik, Kurt
    Mair, Patrick
    Rauch, Johannes
    Geiger, Wilhelm
    Buchta, Christian
    Feinerer, Ingo
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2013, 52 (06):
  • [5] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [6] Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features
    Deniz, Ayca
    Kiziloz, Llakan Ezgi
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 655 - 660
  • [7] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [8] A new type of feature - Loose N-gram feature in text categorization
    Zhang, Xian
    Zhu, Xiaoyan
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 1, PROCEEDINGS, 2007, 4477 : 378 - +
  • [9] A variant of n-gram based language-independent text categorization
    Graovac, Jelena
    [J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
  • [10] Character-Based N-gram Model for Uyghur Text Retrieval
    Tohti, Turdi
    Xu, Lirui
    Huang, Jimmy
    Musajan, Winira
    Hamdulla, Askar
    [J]. BIOMETRIC RECOGNITION, CCBR 2018, 2018, 10996 : 678 - 688