ADtrees for sequential data and n-gram counting

被引：0

作者：

Van Dam, Rob ^{[1
]}

Ventura, Dan ^{[1
]}

机构：

[1] Brigham Young Univ, Dept Comp Sci, Provo, UT 84602 USA

来源：

2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8 | 2007年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.

引用

页码：766 / 771

页数：6

共 50 条

[1] DERIN: A data extraction information and n-gram
Lopes Figueiredo, Leandro Neiva
de Assis, Guilherme Tavares
Ferreira, Anderson A.
[J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
[2] N-gram Insight
Prans, George
[J]. AMERICAN SCIENTIST, 2011, 99 (05) : 356 - 357
[3] Efficient Data Structures for Massive N-Gram Datasets
Pibiri, Giulio Ermanno
Venturini, Rossano
[J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 615 - 624
[4] Arithmetic N-gram: an efficient data compression technique
Hassan, Ali
Javed, Sadaf
Hussain, Sajjad
Ahmad, Rizwan
Qazi, Shams
[J]. DISCOVER COMPUTING, 2024, 27 (01)
[5] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
Zhou, Zhengyu
Meng, Helen
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
[6] N-gram MalGAN: Evading machine learning detection via feature n-gram
Zhu, Enmin
Zhang, Jianjie
Yan, Jijie
Chen, Kongyang
Gao, Chongzhi
[J]. DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
[7] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
Ahmad, Adnan
Talha, Mahbubur Rub
Amin, Md. Ruhul
Chowdhury, Farida
[J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
[8] N-gram MalGAN:Evading machine learning detection via feature n-gram
Enmin Zhu
Jianjie Zhang
Jijie Yan
Kongyang Chen
Chongzhi Gao
[J]. Digital Communications and Networks, 2022, 8 (04) - 491
[9] N-gram similarity and distance
Kondrak, Grzegorz
[J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2005, 3772 : 115 - 126
[10] Learning N-gram Language Models from Uncertain Data
Kuznetsov, Vitaly
Liao, Hank
Mohri, Mehryar
Riley, Michael
Roark, Brian
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2323 - 2327

← 1 2 3 4 5 →