n-Gram-Based Text Compression

被引:9
|
作者
Nguyen, Vu H. [1 ]
Nguyen, Hien T. [1 ]
Duong, Hieu N. [2 ]
Snasel, Vaclav [3 ]
机构
[1] Ton Duc Thang Univ, Fac Informat Technol, Ho Chi Minh City, Vietnam
[2] Ho Chi Minh City Univ Technol, Fac Comp Sci & Engn, Ho Chi Minh City, Vietnam
[3] VSB Tech Univ Ostrava, Fac Elect Engn & Comp Sci, Ostrava, Czech Republic
关键词
D O I
10.1155/2016/9483646
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental
引用
收藏
页数:11
相关论文
共 50 条
  • [1] n-Gram-based indexing for Korean text retrieval
    Lee, JH
    Cho, HY
    Park, HR
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (04) : 427 - 441
  • [2] EVALUATION AND IMPLEMENTATION OF N-GRAM-BASED ALGORITHM FOR FAST TEXT COMPARISON
    Wielgosz, Maciej
    Szczepka, Pawel
    Russek, Pawel
    Jamro, Ernest
    Wiatr, Kazimierz
    Pietron, Marcin
    Zurek, Dominik
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (04) : 887 - 907
  • [3] Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints
    Parapar, Javier
    Barreiro, Alvaro
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 645 - 653
  • [4] N-gram-based machine translation
    Marino, Jose B.
    Banchs, Rafael E.
    Crego, Josep M.
    de Gispert, Adria
    Lambert, Patrik
    Fonollosa, Jose A. R.
    Costa-jussa, Marta R.
    [J]. COMPUTATIONAL LINGUISTICS, 2006, 32 (04) : 527 - 549
  • [5] n-gram-based approach to composer recognition
    Wolkowicz, Jacek
    Kulka, Zbigniew
    Keselj, Vlado
    [J]. ARCHIVES OF ACOUSTICS, 2008, 33 (01) : 43 - 55
  • [6] Character contiguity in N-gram-based word matching:: the case for Arabic text searching
    Mustafa, SH
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (04) : 819 - 827
  • [7] Reordering experiments for N-gram-based SMT
    Crego, Josep M.
    Marino, Jose B.
    [J]. 2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 242 - +
  • [8] N-gram-based detection of new malicious code
    Abou-Assaleh, T
    Cercone, N
    Keselj, V
    Sweidan, R
    [J]. PROCEEDINGS OF THE 28TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATION CONFERENCE, WORKSHOP AND FAST ABSTRACTS, 2004, : 41 - 42
  • [9] Hierarchical vs. flat n-gram-based text categorization: can we do better?
    Graovac, Jelena
    Kovacevic, Jovana
    Pavlovic-Lazetic, Gordana
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2017, 14 (01) : 103 - 121
  • [10] Generation, implementation, and appraisal of an N-gram-based stemming algorithm
    Pande, Bhagwati P.
    Tamta, Pawan
    Dhami, Hoshiyar S.
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2019, 34 (03) : 558 - 568