The textcat Package for n-Gram Based Text Categorization in R

被引：0

作者：

Hornik, Kurt ^{[1
]}

Mair, Patrick

Rauch, Johannes

Geiger, Wilhelm

Buchta, Christian

Feinerer, Ingo ^{[2
]}

机构：

[1] WU Wirtschaftsuniv Wien, Inst Stat & Math, Dept Finance Accounting & Stat, A-1090 Vienna, Austria

[2] Vienna Univ Technol, Vienna, Austria

来源：

JOURNAL OF STATISTICAL SOFTWARE | 2013年 / 52卷 / 06期

关键词：

text mining; text categorization; language identification; n-grams; textcat; R;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

引用

页数：17

共 50 条

[1] A variant of n-gram based language-independent text categorization
Graovac, Jelena
[J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
[2] Chinese Text Categorization Using the Character N-gram
Suzuki, Makoto
Yamagishi, Naohide
Tsai, Yi-Ching
[J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
[3] Multilingual Text Categorization Using Character N-gram
Suzuki, Makoto
Yamagishi, Naohide
Tsai, Yi-Ching
Hirasawa, Shigeichi
[J]. 2008 IEEE CONFERENCE ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS SMCIA/08, 2009, : 49 - +
[4] An Evaluation of Character Level N-gram Termsets in Text Categorization
Coban, Onder
Ozel, Selma Ayse
[J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
[5] A new type of feature - Loose N-gram feature in text categorization
Zhang, Xian
Zhu, Xiaoyan
[J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 1, PROCEEDINGS, 2007, 4477 : 378 - +
[6] A comparison of text-categorization methods applied to N-gram frequency statistics
Berger, H
Merkl, D
[J]. AI 2004: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3339 : 998 - 1003
[7] Efficient n-gram construction for text categorization using feature selection techniques
Garcia, Maximiliano
Maldonado, Sebastian
Vairetti, Carla
[J]. INTELLIGENT DATA ANALYSIS, 2021, 25 (03) : 509 - 525
[8] Effects of Various Preprocessing Techniques to Turkish Text Categorization Using N-Gram Features
Deniz, Ayca
Kiziloz, Llakan Ezgi
[J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 655 - 660
[9] N-gram Analysis of a Mongolian Text
Altangerel, Khuder
Tsend, Ganbat
Jalsan, Khash-Erdene
[J]. IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 258 - 259
[10] SEARCHING FOR TEXT - SEND AN N-GRAM
KIMBRELL, RE
[J]. BYTE, 1988, 13 (05): : 297 - &

← 1 2 3 4 5 →