A Theoretical Model for n-gram Distribution in Big Data Corpora

被引：0

作者：

Silva, Joaquim F. ^{[1
]}

Goncalves, Carlos ^{[1
,2
]}

Cunha, Jose C. ^{[1
]}

机构：

[1] Univ Nova Lisboa, Fac Ciencias & Tecnol, NOVA Lab Comp Sci & Informat, P-2829516 Caparica, Portugal

[2] IPL, ISEL, P-1959007 Lisbon, Portugal

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2016年

关键词：

n-gram Models; Big Data; Zipf-Mandelbrot Law; Poisson Distribution; Extraction of Relevant Expressions; HEAPS LAW; LANGUAGE;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (ngrams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the ngrams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams,..., 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.

引用

页码：134 / 141

页数：8

共 50 条

[1] An Empirical Model for n-gram Frequency Distribution in Large Corpora
Silva, Joaquim F.
Cunha, Jose C.
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2020, PT II, 2020, 12085 : 840 - 851
[2] Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR
Zhou, Zhengyu
Meng, Helen
[J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4933 - 4936
[3] Adaptable N-gram Classification Model for Data Leakage Prevention
Alneyadi, Sultan
Sithirasenan, Elankayer
Muthukkumarasamy, Vallipuram
[J]. 2013 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2013,
[4] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
Zhou, Zhengyu
Meng, Helen
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
[5] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
Ahmad, Adnan
Talha, Mahbubur Rub
Amin, Md. Ruhul
Chowdhury, Farida
[J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
[6] Analysis of Historical Medical Phenomena Using Large N-Gram Corpora
Kasac, Zdenko
Schulz, Stefan
[J]. MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 437 - 441
[7] Supervised N-gram Topic Model
Kawamae, Noriaki
[J]. WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, : 473 - 482
[8] Similar N-gram Language Model
Gillot, Christian
Cerisara, Christophe
Langlois, David
Haton, Jean-Paul
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827
[9] DERIN: A data extraction information and n-gram
Lopes Figueiredo, Leandro Neiva
de Assis, Guilherme Tavares
Ferreira, Anderson A.
[J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
[10] ADtrees for sequential data and n-gram counting
Van Dam, Rob
Ventura, Dan
[J]. 2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 766 - 771

← 1 2 3 4 5 →