Bangla Word Clustering Based on N-gram Language Model

被引:0
|
作者
Ismail, Sabir [1 ]
Rahman, M. Shahidur [1 ]
机构
[1] Shahjalal Univ Sci & Technol, Dept Comp Sci & Engn, Sylhet, Bangladesh
关键词
word cluster; information retrival; natural language processing; machine learning; n-gram model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hamarashid, Hozan K.
    Saeed, Soran A.
    Rashid, Tarik A.
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (09): : 4547 - 4566
  • [22] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hozan K. Hamarashid
    Soran A. Saeed
    Tarik A. Rashid
    [J]. Neural Computing and Applications, 2021, 33 : 4547 - 4566
  • [23] Combination of Random Indexing based Language Model and N-gram Language Model for Speech Recognition
    Fohr, Dominique
    Mella, Odile
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2231 - 2235
  • [24] Word clustering based on similarity and vari-gram language model
    Yuan, LC
    Zhong, YX
    [J]. ICCC2004: Proceedings of the 16th International Conference on Computer Communication Vol 1and 2, 2004, : 1222 - 1226
  • [25] Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model
    Tian, Jinchuan
    Yu, Jianwei
    Weng, Chao
    Zou, Yuexian
    Yu, Dong
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 812 - 816
  • [26] Modified Chinese N-gram statistical language model
    Tian, Bin
    Tian, Hongxin
    Yi, Kechu
    [J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2000, 27 (01): : 62 - 64
  • [27] Profile based compression of n-gram language models
    Olsen, Jesper
    Oria, Daniela
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1041 - 1044
  • [28] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [29] Comparison of web-based unsupervised translation disambiguation word model and N-gram model
    Institute of Computational Linguistics, Peking University, Beijing 100871, China
    不详
    [J]. Dianzi Yu Xinxi Xuebao, 2009, 12 (2969-2974):
  • [30] A variable-length category-based n-gram language model
    Niesler, TR
    Woodland, PC
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 164 - 167