Bangla Word Clustering Based on N-gram Language Model

被引:0
|
作者
Ismail, Sabir [1 ]
Rahman, M. Shahidur [1 ]
机构
[1] Shahjalal Univ Sci & Technol, Dept Comp Sci & Engn, Sylhet, Bangladesh
关键词
word cluster; information retrival; natural language processing; machine learning; n-gram model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Comparison of web-based unsupervised translation disambiguation word model and N-gram model
    Institute of Computational Linguistics, Peking University, Beijing 100871, China
    不详
    [J]. Dianzi Yu Xinxi Xuebao, 2009, 12 (2969-2974):
  • [32] Word N-gram Based Classification for Data Leakage Prevention
    Alneyadi, Sultan
    Sithirasenan, Elankayer
    Muthukkumarasamy, Vallipuram
    [J]. 2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, : 578 - 585
  • [33] Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables
    Allam, Tahani Mahmoud
    Sallam, Alsayed Abdelhameed
    Abdullkader, Hatem M.
    [J]. 2014 9TH INTERNATIONAL CONFERENCE ON INFORMATICS AND SYSTEMS (INFOS), 2014,
  • [34] N-gram Based Sentiment Mining for Bangla Text Using Support Vector Machine
    Abu Taher, S. M.
    Akhter, Kazi Afsana
    Hasan, K. M. Azharul
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [35] Multi-class composite N-gram language model for spoken language processing using multiple word clusters
    Yamamoto, H
    Isogai, S
    Sagisaka, Y
    [J]. 39TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2001, : 531 - 538
  • [36] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [37] Croatian Language N-Gram System
    Dembitz, Sandor
    Blaskovic, Bruno
    Gledec, Gordan
    [J]. ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 696 - 705
  • [38] Bayesian estimation methods for N-gram language model adaptation
    Federico, M
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 240 - 243
  • [39] Discriminative n-gram language modeling
    Roark, Brian
    Saraclar, Murat
    Collins, Michael
    [J]. COMPUTER SPEECH AND LANGUAGE, 2007, 21 (02): : 373 - 392
  • [40] A WEIGHTED AVERAGE N-GRAM MODEL OF NATURAL-LANGUAGE
    OBOYLE, P
    OWENS, M
    SMITH, FJ
    [J]. COMPUTER SPEECH AND LANGUAGE, 1994, 8 (04): : 337 - 349