Bangla Word Clustering Based on N-gram Language Model

被引:0
|
作者
Ismail, Sabir [1 ]
Rahman, M. Shahidur [1 ]
机构
[1] Shahjalal Univ Sci & Technol, Dept Comp Sci & Engn, Sylhet, Bangladesh
关键词
word cluster; information retrival; natural language processing; machine learning; n-gram model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] A Framework for Word Clustering of Bangla Sentences Using Higher Order N-gram Language Model
    Husna, Asmaul
    Mostofa, Maliha
    Khatun, Ayesha
    Islam, Jahidul
    Mahin, Md.
    [J]. 2018 INTERNATIONAL CONFERENCE ON INNOVATION IN ENGINEERING AND TECHNOLOGY (ICIET), 2018,
  • [2] A Corpus Based Unsupervised Bangla Word Stemming Using N-Gram Language Model
    Urmi, Tapashee Tabassum
    Jammy, Jasmine Jahan
    Ismail, Sabir
    [J]. 2016 5TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS AND VISION (ICIEV), 2016, : 824 - 828
  • [3] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [4] An N-gram based model for predicting of word-formation in Assamese language
    Bhuyan, M. P.
    Sarma, S. K.
    [J]. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2019, 40 (02): : 427 - 440
  • [5] Similar N-gram Language Model
    Gillot, Christian
    Cerisara, Christophe
    Langlois, David
    Haton, Jean-Paul
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827
  • [6] N-gram Language Model for Chinese Function-word-centered Patterns
    Song, Jie
    Liu, Yixiao
    Qu, Yunhua
    [J]. Journal of Computing and Information Technology, 2023, 31 (01) : 39 - 55
  • [7] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    [J]. INFORMATION, 2019, 10 (10)
  • [8] A New Estimate of the n-gram Language Model
    Aouragh, Si Lhoussain
    Yousfi, Abdellah
    Laaroussi, Saida
    Gueddah, Hicham
    Nejja, Mohammed
    [J]. AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 211 - 215
  • [9] Vari-gram language model based on word clustering
    袁里驰
    [J]. Journal of Central South University, 2012, 19 (04) : 1057 - 1062
  • [10] Vari-gram language model based on word clustering
    Li-chi Yuan
    [J]. Journal of Central South University, 2012, 19 : 1057 - 1062