The textcat Package for n-Gram Based Text Categorization in R

被引:0
|
作者
Hornik, Kurt [1 ]
Mair, Patrick
Rauch, Johannes
Geiger, Wilhelm
Buchta, Christian
Feinerer, Ingo [2 ]
机构
[1] WU Wirtschaftsuniv Wien, Inst Stat & Math, Dept Finance Accounting & Stat, A-1090 Vienna, Austria
[2] Vienna Univ Technol, Vienna, Austria
来源
JOURNAL OF STATISTICAL SOFTWARE | 2013年 / 52卷 / 06期
关键词
text mining; text categorization; language identification; n-grams; textcat; R;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] A variant of n-gram based language-independent text categorization
    Graovac, Jelena
    [J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
  • [2] Chinese Text Categorization Using the Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    [J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
  • [3] Multilingual Text Categorization Using Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    Hirasawa, Shigeichi
    [J]. 2008 IEEE CONFERENCE ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS SMCIA/08, 2009, : 49 - +
  • [4] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [5] A new type of feature - Loose N-gram feature in text categorization
    Zhang, Xian
    Zhu, Xiaoyan
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 1, PROCEEDINGS, 2007, 4477 : 378 - +
  • [6] A comparison of text-categorization methods applied to N-gram frequency statistics
    Berger, H
    Merkl, D
    [J]. AI 2004: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3339 : 998 - 1003
  • [7] Efficient n-gram construction for text categorization using feature selection techniques
    Garcia, Maximiliano
    Maldonado, Sebastian
    Vairetti, Carla
    [J]. INTELLIGENT DATA ANALYSIS, 2021, 25 (03) : 509 - 525
  • [8] Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features
    Deniz, Ayca
    Kiziloz, Llakan Ezgi
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 655 - 660
  • [9] N-gram Analysis of a Mongolian Text
    Altangerel, Khuder
    Tsend, Ganbat
    Jalsan, Khash-Erdene
    [J]. IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 258 - 259
  • [10] SEARCHING FOR TEXT - SEND AN N-GRAM
    KIMBRELL, RE
    [J]. BYTE, 1988, 13 (05): : 297 - &