Adaptive Compression-based Models of Chinese Text

被引：0

作者：

Teahan, William J. ^{[1
]}

Wu, Peiliang ^{[1
]}

Liu, Wei ^{[1
]}

机构：

[1] Bangor Univ, Sch Comp Sci, Bangor, Gwynedd, Wales

来源：

2014 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), VOLS 1-2 | 2014年

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.

引用

页码：874 / 881

页数：8

共 50 条

[1] Statistical Compression-Based Models for Text Classification
Saikrishna, Vidya
Dowe, David L.
Ray, Sid
[J]. 2016 FIFTH INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS), 2016, : 1 - 6
[2] On compression-based text classification
Marton, Y
Wu, N
Hellerstein, L
[J]. ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 300 - 314
[3] A compression-based text steganography method
Satir, Esra
Isik, Hakan
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2012, 85 (10) : 2385 - 2394
[4] Compression-Based Arabic Text Classification
Ta'amneh, Haneen
Abu Keshek, Ehsan
Issa, Manar Bani
Al-Ayyoub, Mahmoud
Jararweh, Yaser
[J]. 2014 IEEE/ACS 11TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2014, : 594 - 600
[5] A compression-based algorithm for Chinese word segmentation
Teahan, WJ
Wen, YY
McNab, R
Witten, IH
[J]. COMPUTATIONAL LINGUISTICS, 2000, 26 (03) : 375 - 393
[6] Relevance of Contextual Information in Compression-Based Text Clustering
Granados, Ana
Martinez, Rafael
Camacho, David
de Borja Rodriguez, Francisco
[J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2010, 2010, 6283 : 259 - 266
[7] Text Classification Using Compression-Based Dissimilarity Measures
Coutinho, David Pereira
Figueiredo, Mario A. T.
[J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (05)
[8] Adult Content Filtering through Compression-Based Text Classification
Santos, Igor
Galan-Garcia, Patxi
Santamaria-Ibirika, Aitor
Alonso-Isla, Borja
Alabau-Sarasola, Iker
Garcia Bringas, Pablo
[J]. INTERNATIONAL JOINT CONFERENCE CISIS'12 - ICEUTE'12 - SOCO'12 SPECIAL SESSIONS, 2013, 189 : 281 - 288
[9] A Compression-Based Toolkit for Modelling and Processing Natural Language Text
Teahan, William John
[J]. INFORMATION, 2018, 9 (12)
[10] Compression-Based Document Length Prior for Language Models
Parapar, Javier
Losada, David E.
Barreiro, Alvaro
[J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 652 - 653

← 1 2 3 4 5 →