Bangla Word Clustering Based on N-gram Language Model

被引：0

作者：

Ismail, Sabir ^{[1
]}

Rahman, M. Shahidur ^{[1
]}

机构：

[1] Shahjalal Univ Sci & Technol, Dept Comp Sci & Engn, Sylhet, Bangladesh

来源：

2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014) | 2014年

关键词：

word cluster; information retrival; natural language processing; machine learning; n-gram model;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In this paper, we describe a method for producing Bangla word clusters based on semantic and contextual similarity. Word clustering is important for parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. Computerization of Bangla language processing has been started a long ago, but still it is in neophyte stage and suffers from resource scarcity. We propose anunsupervised machine learning technique to develop Bangla word clusters based on their semantic and contextual similarity using N-gram language model. According to N-gram model, a word can be predictedbased on its previous and next words sequence. N-gram model is applied successfully for word clustering in English and some other languages. As word clustering in Bangla is a new dimension in Bangla language processing research, so we think this process is good way to start and our assumption is true as our result is quite decent. We produced 456 clusters using a locally available large Bangla corpus. Subjective score derived from the clusters reveal strong similarity of the words in the same cluster.

引用

页数：5

共 50 条

[41] Multi-class composite N-gram language model
Yamamoto, H
Isogai, S
Sagisaka, Y
SPEECH COMMUNICATION, 2003, 41 (2-3) : 369 - 379
[42] UNSUPERVISED LANGUAGE MODEL ADAPTATION USING N-GRAM WEIGHTING
Haidar, Md. Akmal
O'Shaughnessy, Douglas
2011 24TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2011, : 857 - 860
[43] N-gram Language Model Based on Multi-Word Expressions in Web Documents for Speech Recognition and Closed-Captioning
Takahashi, Shinya
Morimoto, Tsuyoshi
2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 225 - 228
[44] English grammar intelligent error correction technology based on the n-gram language model
Xiao, Fan
Yin, Shehui
JOURNAL OF INTELLIGENT SYSTEMS, 2024, 33 (01)
[45] Aspect clustering combined n-gram for reviews
Zhang, Shibo
Wang, Xiaojie
Open Cybernetics and Systemics Journal, 2014, 8 (01): : 938 - 943
[46] Rich Morphology Based N-gram Language Models for Arabic
Emami, Ahmad
Zitouni, Imed
Mangu, Lidia
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 829 - 832
[47] Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR
Zhou, Zhengyu
Meng, Helen
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4933 - 4936
[48] Clustering botnet communication traffic based on n-gram feature selection
Lu, Wei
Rammidi, Goaletsa
Ghorbani, Ali A.
COMPUTER COMMUNICATIONS, 2011, 34 (03) : 502 - 514
[49] Short Text Clustering using Numerical data based on N-gram
Kumar, Rajiv
Mathur, Robin Prakash
2014 5TH INTERNATIONAL CONFERENCE CONFLUENCE THE NEXT GENERATION INFORMATION TECHNOLOGY SUMMIT (CONFLUENCE), 2014, : 274 - 276
[50] Language model based on word clustering
Yuan, Lichi
PACLIC 20: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, 2006, : 394 - 397

← 1 2 3 4 5 →