ADtrees for sequential data and n-gram counting

被引:0
|
作者
Van Dam, Rob [1 ]
Ventura, Dan [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
引用
收藏
页码:766 / 771
页数:6
相关论文
共 50 条
  • [1] DERIN: A data extraction information and n-gram
    Lopes Figueiredo, Leandro Neiva
    de Assis, Guilherme Tavares
    Ferreira, Anderson A.
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
  • [2] N-gram Insight
    Prans, George
    [J]. AMERICAN SCIENTIST, 2011, 99 (05) : 356 - 357
  • [3] Efficient Data Structures for Massive N-Gram Datasets
    Pibiri, Giulio Ermanno
    Venturini, Rossano
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 615 - 624
  • [4] Arithmetic N-gram: an efficient data compression technique
    Hassan, Ali
    Javed, Sadaf
    Hussain, Sajjad
    Ahmad, Rizwan
    Qazi, Shams
    [J]. DISCOVER COMPUTING, 2024, 27 (01)
  • [5] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
  • [6] N-gram MalGAN: Evading machine learning detection via feature n-gram
    Zhu, Enmin
    Zhang, Jianjie
    Yan, Jijie
    Chen, Kongyang
    Gao, Chongzhi
    [J]. DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
  • [7] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
    Ahmad, Adnan
    Talha, Mahbubur Rub
    Amin, Md. Ruhul
    Chowdhury, Farida
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [8] N-gram MalGAN:Evading machine learning detection via feature n-gram
    Enmin Zhu
    Jianjie Zhang
    Jijie Yan
    Kongyang Chen
    Chongzhi Gao
    [J]. Digital Communications and Networks, 2022, 8 (04) - 491
  • [9] N-gram similarity and distance
    Kondrak, Grzegorz
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2005, 3772 : 115 - 126
  • [10] Learning N-gram Language Models from Uncertain Data
    Kuznetsov, Vitaly
    Liao, Hank
    Mohri, Mehryar
    Riley, Michael
    Roark, Brian
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2323 - 2327