Constructing n-gram rules for natural language models through exploring the limitation of the Zipf–Mandelbrot law

被引:0
|
作者
Harry M. Chang
机构
[1] AT&T Labs-Research,
来源
Computing | 2011年 / 91卷
关键词
Zipf–Mandelbrot law; Natural language processing; -gram statistical language models; Quantitative linguistics; 60; 62; 94;
D O I
暂无
中图分类号
学科分类号
摘要
The Zipf–Mandelbrot law is widely used to model a power-law distribution on ranked data. One of the best known applications of the Zipf–Mandelbrot law is in the area of linguistic analysis of the distribution of words ranked by their frequency in a text corpus. By exploring known limitations of the Zipf–Mandelbrot law in modeling the actual linguistic data from different domains in both printed media and online content, a new algorithm is developed to effectively construct n-gram rules for building natural language (NL) models required for a human-to-computer interface. The construction of statistically-oriented n-gram rules is based on a new computing algorithm that identifies the area of divergence between Zipf–Mandelbrot curve and the actual frequency distribution of the ranked n-gram text tokens extracted from a large text corpus derived from the online electronic programming guide (EPG) for television shows and movies. Two empirical experiments were carried out to evaluate the EPG-specific language models created using the new algorithm in the context of NL-based information retrieval systems. The experimental results show the effectiveness of the algorithm for developing low-complexity concept models with high coverage for the user’s language models associated with both typed and spoken queries when interacting with a NL-based EPG search interface.
引用
收藏
页码:241 / 264
页数:23
相关论文
共 50 条
  • [1] Constructing n-gram rules for natural language models through exploring the limitation of the Zipf-Mandelbrot law
    Chang, Harry M.
    [J]. COMPUTING, 2011, 91 (03) : 241 - 264
  • [2] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
    Huang, Songfang
    Renals, Steve
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
  • [3] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [4] SPANISH LINGUISTIC STEGANOGRAPHY BASED ON N-GRAM MODEL AND ZIPF LAW
    Munoz Munoz, Alfonso
    Argueelles Alvarez, Irina
    [J]. ARBOR-CIENCIA PENSAMIENTO Y CULTURA, 2014, 190 (768)
  • [5] Perplexity of n-Gram and Dependency Language Models
    Popel, Martin
    Marecek, David
    [J]. TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
  • [6] MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
    Sak, Hasim
    Allauzen, Cyril
    Nakajima, Kaisuke
    Beaufays, Francoise
    [J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 31 - 36
  • [7] Use of statistical N-gram models in natural language generation for machine translation
    Liu, FH
    Gu, L
    Gao, YQ
    Picheny, M
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 636 - 639
  • [8] Profile based compression of n-gram language models
    Olsen, Jesper
    Oria, Daniela
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1041 - 1044
  • [9] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [10] N-gram language models for massively parallel devices
    Bogoychev, Nikolay
    Lopez, Adam
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953