ADtrees for sequential data and n-gram counting

被引:0
|
作者
Van Dam, Rob [1 ]
Ventura, Dan [1 ]
机构
[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
引用
收藏
页码:766 / 771
页数:6
相关论文
共 50 条
  • [21] N-gram approach for gender prediction
    Reddy, T. Raghunadha
    Vardhan, B. Vishnu
    Reddy, P. Vijayapal
    [J]. 2017 7TH IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2017, : 860 - 865
  • [22] Distributing N-Gram Graphs for Classification
    Kontopoulos, Ioannis
    Giannakopoulos, George
    Varlamis, Iraklis
    [J]. NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 767 : 3 - 11
  • [23] Classification of facemarks using N-gram
    Yamada, Thichi
    Tsuchiya, Seiji
    Kuroiwa, Shiongo
    Ren, Fuji
    [J]. PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (NLP-KE'07), 2007, : 322 - +
  • [24] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [25] Semantic N-Gram Topic Modeling
    Kherwa, Pooja
    Bansal, Poonam
    [J]. EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2020, 7 (26) : 1 - 12
  • [26] N-gram Analysis of a Mongolian Text
    Altangerel, Khuder
    Tsend, Ganbat
    Jalsan, Khash-Erdene
    [J]. IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 258 - 259
  • [27] Differentially Private n-gram Extraction
    Kim, Kunho
    Gopi, Sivakanth
    Kulkarni, Janardhan
    Yekhanin, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [28] SEARCHING FOR TEXT - SEND AN N-GRAM
    KIMBRELL, RE
    [J]. BYTE, 1988, 13 (05): : 297 - &
  • [29] Uniquely decodable n-gram embeddings
    Kontorovich, L
    [J]. THEORETICAL COMPUTER SCIENCE, 2004, 329 (1-3) : 271 - 284
  • [30] Text mining with n-gram variables
    Schonlau, Matthias
    Guenther, Nick
    Sucholutsky, Ilia
    [J]. STATA JOURNAL, 2017, 17 (04): : 866 - 881