A Corpus Based Unsupervised Bangla Word Stemming Using N-Gram Language Model

被引:0
|
作者
Urmi, Tapashee Tabassum [1 ]
Jammy, Jasmine Jahan [1 ]
Ismail, Sabir [1 ]
机构
[1] Shahjalal Univ Sci & Technol Sylhet, Dept Comp Sci & Engn, Sylhet 3114, Bangladesh
关键词
unsupervised learning; natural language processing; n-gram model; root word; stemming;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we propose a contextual similarity based approach for identification of stems or root forms of Bangla words using N-gram language model. The core purpose of our work is to build a big corpus of Bangla stems with their corresponding inflectional forms. Identification of stem form of a word is generally called stemming and the tool which identifies the stems is called stemmer. Stemmers are important mainly in information retrieval systems, recommending systems, spell checkers, search engines and other sectors of Natural Language Processing applications. We selected N-gram model for stem detection based on the assumption that if two words which exhibit a certain percentage of similarity in spelling and have a certain percentage of contextual similarity in many sentences then these words have higher probability of originating from the same root. We implemented 6-gram model for the stem identification procedure and we gained 40.18% accuracy for our corpus.
引用
收藏
页码:824 / 828
页数:5
相关论文
共 50 条
  • [1] Bangla Word Clustering Based on N-gram Language Model
    Ismail, Sabir
    Rahman, M. Shahidur
    [J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,
  • [2] A Framework for Word Clustering of Bangla Sentences Using Higher Order N-gram Language Model
    Husna, Asmaul
    Mostofa, Maliha
    Khatun, Ayesha
    Islam, Jahidul
    Mahin, Md.
    [J]. 2018 INTERNATIONAL CONFERENCE ON INNOVATION IN ENGINEERING AND TECHNOLOGY (ICIET), 2018,
  • [3] UNSUPERVISED LANGUAGE MODEL ADAPTATION USING N-GRAM WEIGHTING
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 2011 24TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2011, : 857 - 860
  • [4] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [5] An N-gram based model for predicting of word-formation in Assamese language
    Bhuyan, M. P.
    Sarma, S. K.
    [J]. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2019, 40 (02): : 427 - 440
  • [6] Unsupervised word sense disambiguation with N-gram features
    Daniel Preotiuc-Pietro
    Florentina Hristea
    [J]. Artificial Intelligence Review, 2014, 41 : 241 - 260
  • [7] Unsupervised word sense disambiguation with N-gram features
    Preotiuc-Pietro, Daniel
    Hristea, Florentina
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2014, 41 (02) : 241 - 260
  • [8] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
    Yamane, Hiroaki
    Hagiwara, Masafumi
    [J]. SOFT COMPUTING, 2015, 19 (04) : 919 - 927
  • [9] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
    Hiroaki Yamane
    Masafumi Hagiwara
    [J]. Soft Computing, 2015, 19 : 919 - 927
  • [10] Similar N-gram Language Model
    Gillot, Christian
    Cerisara, Christophe
    Langlois, David
    Haton, Jean-Paul
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827