A Theoretical Model for n-gram Distribution in Big Data Corpora

被引:0
|
作者
Silva, Joaquim F. [1 ]
Goncalves, Carlos [1 ,2 ]
Cunha, Jose C. [1 ]
机构
[1] Univ Nova Lisboa, Fac Ciencias & Tecnol, NOVA Lab Comp Sci & Informat, P-2829516 Caparica, Portugal
[2] IPL, ISEL, P-1959007 Lisbon, Portugal
关键词
n-gram Models; Big Data; Zipf-Mandelbrot Law; Poisson Distribution; Extraction of Relevant Expressions; HEAPS LAW; LANGUAGE;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (ngrams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the ngrams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams,..., 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.
引用
收藏
页码:134 / 141
页数:8
相关论文
共 50 条
  • [1] An Empirical Model for n-gram Frequency Distribution in Large Corpora
    Silva, Joaquim F.
    Cunha, Jose C.
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2020, PT II, 2020, 12085 : 840 - 851
  • [2] Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4933 - 4936
  • [3] Adaptable N-gram Classification Model for Data Leakage Prevention
    Alneyadi, Sultan
    Sithirasenan, Elankayer
    Muthukkumarasamy, Vallipuram
    [J]. 2013 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2013,
  • [4] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
  • [5] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
    Ahmad, Adnan
    Talha, Mahbubur Rub
    Amin, Md. Ruhul
    Chowdhury, Farida
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [6] Analysis of Historical Medical Phenomena Using Large N-Gram Corpora
    Kasac, Zdenko
    Schulz, Stefan
    [J]. MEDINFO 2017: PRECISION HEALTHCARE THROUGH INFORMATICS, 2017, 245 : 437 - 441
  • [7] Supervised N-gram Topic Model
    Kawamae, Noriaki
    [J]. WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, : 473 - 482
  • [8] Similar N-gram Language Model
    Gillot, Christian
    Cerisara, Christophe
    Langlois, David
    Haton, Jean-Paul
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827
  • [9] DERIN: A data extraction information and n-gram
    Lopes Figueiredo, Leandro Neiva
    de Assis, Guilherme Tavares
    Ferreira, Anderson A.
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
  • [10] ADtrees for sequential data and n-gram counting
    Van Dam, Rob
    Ventura, Dan
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 766 - 771