Adaptive Compression-based Models of Chinese Text

被引:0
|
作者
Teahan, William J. [1 ]
Wu, Peiliang [1 ]
Liu, Wei [1 ]
机构
[1] Bangor Univ, Sch Comp Sci, Bangor, Gwynedd, Wales
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.
引用
收藏
页码:874 / 881
页数:8
相关论文
共 50 条
  • [1] Statistical Compression-Based Models for Text Classification
    Saikrishna, Vidya
    Dowe, David L.
    Ray, Sid
    [J]. 2016 FIFTH INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS), 2016, : 1 - 6
  • [2] On compression-based text classification
    Marton, Y
    Wu, N
    Hellerstein, L
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 300 - 314
  • [3] A compression-based text steganography method
    Satir, Esra
    Isik, Hakan
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2012, 85 (10) : 2385 - 2394
  • [4] Compression-Based Arabic Text Classification
    Ta'amneh, Haneen
    Abu Keshek, Ehsan
    Issa, Manar Bani
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    [J]. 2014 IEEE/ACS 11TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2014, : 594 - 600
  • [5] A compression-based algorithm for Chinese word segmentation
    Teahan, WJ
    Wen, YY
    McNab, R
    Witten, IH
    [J]. COMPUTATIONAL LINGUISTICS, 2000, 26 (03) : 375 - 393
  • [6] Relevance of Contextual Information in Compression-Based Text Clustering
    Granados, Ana
    Martinez, Rafael
    Camacho, David
    de Borja Rodriguez, Francisco
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2010, 2010, 6283 : 259 - 266
  • [7] Text Classification Using Compression-Based Dissimilarity Measures
    Coutinho, David Pereira
    Figueiredo, Mario A. T.
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (05)
  • [8] Adult Content Filtering through Compression-Based Text Classification
    Santos, Igor
    Galan-Garcia, Patxi
    Santamaria-Ibirika, Aitor
    Alonso-Isla, Borja
    Alabau-Sarasola, Iker
    Garcia Bringas, Pablo
    [J]. INTERNATIONAL JOINT CONFERENCE CISIS'12 - ICEUTE'12 - SOCO'12 SPECIAL SESSIONS, 2013, 189 : 281 - 288
  • [9] A Compression-Based Toolkit for Modelling and Processing Natural Language Text
    Teahan, William John
    [J]. INFORMATION, 2018, 9 (12)
  • [10] Compression-Based Document Length Prior for Language Models
    Parapar, Javier
    Losada, David E.
    Barreiro, Alvaro
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 652 - 653