Development of the N-gram Model for Azerbaijani Language

被引:0
|
作者
Bannayeva, Aliya [1 ]
Aslanov, Mustafa [1 ]
机构
[1] ADA Univ, Sch Informat & Technol, Baku, Azerbaijan
关键词
N-grams; Markov Model; word prediction; Azerbaijani language;
D O I
10.1109/AICT50176.2020.9368645
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters. For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word. The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction. Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] Efficient MDI Adaptation for n-gram Language Models
    Huang, Ruizhe
    Li, Ke
    Arora, Ashish
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2020, 2020, : 4916 - 4920
  • [42] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [43] Content Development Using N-gram Model in Custom Writing Style
    Dhar, Joydip
    Gandhi, Vipul
    [J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCITE) - NEXT GENERATION IT SUMMIT ON THE THEME - INTERNET OF THINGS: CONNECT YOUR WORLDS, 2016,
  • [44] Bugram: Bug Detection with N-gram Language Models
    Wang, Song
    Chollak, Devin
    Movshovitz-Attias, Dana
    Tan, Lin
    [J]. 2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2016, : 708 - 719
  • [45] N-gram language models for document image decoding
    Kopec, GE
    Said, MR
    Popat, K
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL IX, 2002, 4670 : 191 - 202
  • [46] Multilingual stochastic n-gram class language models
    Jardino, M
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163
  • [47] Constrained Discriminative Training of N-gram Language Models
    Rastrow, Ariya
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 311 - +
  • [48] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
    Huang, Songfang
    Renals, Steve
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
  • [49] Active Learning for Language Identification with N-gram Technique
    Feng , Yuxin
    [J]. 2021 2ND INTERNATIONAL CONFERENCE ON BIG DATA & ARTIFICIAL INTELLIGENCE & SOFTWARE ENGINEERING (ICBASE 2021), 2021, : 560 - 564
  • [50] Fast language model look-ahead algorithm using extended N-gram model
    Shan, Yu-Xiang
    Chen, Xie
    Shi, Yong-Zhe
    Liu, Jia
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2012, 38 (10): : 1618 - 1626