Development of the N-gram Model for Azerbaijani Language

被引:0
|
作者
Bannayeva, Aliya [1 ]
Aslanov, Mustafa [1 ]
机构
[1] ADA Univ, Sch Informat & Technol, Baku, Azerbaijan
关键词
N-grams; Markov Model; word prediction; Azerbaijani language;
D O I
10.1109/AICT50176.2020.9368645
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters. For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word. The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction. Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Similar N-gram Language Model
    Gillot, Christian
    Cerisara, Christophe
    Langlois, David
    Haton, Jean-Paul
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827
  • [2] A New Estimate of the n-gram Language Model
    Aouragh, Si Lhoussain
    Yousfi, Abdellah
    Laaroussi, Saida
    Gueddah, Hicham
    Nejja, Mohammed
    [J]. AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 211 - 215
  • [3] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. Lect. Notes Comput. Sci, 1600, (557-565):
  • [4] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [5] Modified Chinese N-gram statistical language model
    Tian, Bin
    Tian, Hongxin
    Yi, Kechu
    [J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2000, 27 (01): : 62 - 64
  • [6] Modified Chinese N-gram statistical language model
    Tian, Bin
    Tian, Hongxin
    Yi, Kechu
    [J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2000, 27 (01): : 62 - 64
  • [7] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [8] Bangla Word Clustering Based on N-gram Language Model
    Ismail, Sabir
    Rahman, M. Shahidur
    [J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,
  • [9] Croatian Language N-Gram System
    Dembitz, Sandor
    Blaskovic, Bruno
    Gledec, Gordan
    [J]. ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 696 - 705
  • [10] Bayesian estimation methods for N-gram language model adaptation
    Federico, M
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 240 - 243