A Corpus Based Unsupervised Bangla Word Stemming Using N-Gram Language Model

被引:0
|
作者
Urmi, Tapashee Tabassum [1 ]
Jammy, Jasmine Jahan [1 ]
Ismail, Sabir [1 ]
机构
[1] Shahjalal Univ Sci & Technol Sylhet, Dept Comp Sci & Engn, Sylhet 3114, Bangladesh
关键词
unsupervised learning; natural language processing; n-gram model; root word; stemming;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we propose a contextual similarity based approach for identification of stems or root forms of Bangla words using N-gram language model. The core purpose of our work is to build a big corpus of Bangla stems with their corresponding inflectional forms. Identification of stem form of a word is generally called stemming and the tool which identifies the stems is called stemmer. Stemmers are important mainly in information retrieval systems, recommending systems, spell checkers, search engines and other sectors of Natural Language Processing applications. We selected N-gram model for stem detection based on the assumption that if two words which exhibit a certain percentage of similarity in spelling and have a certain percentage of contextual similarity in many sentences then these words have higher probability of originating from the same root. We implemented 6-gram model for the stem identification procedure and we gained 40.18% accuracy for our corpus.
引用
收藏
页码:824 / 828
页数:5
相关论文
共 50 条
  • [21] Polish Word Recognition Based on n-Gram Methods
    Wojcicki, Piotr
    Zientarski, Tomasz
    [J]. IEEE ACCESS, 2024, 12 : 49817 - 49825
  • [22] A Novel Interpolated N-gram Language Model Based on Class Hierarchy
    Lv, Zhenyu
    Liu, Wenju
    Yang, Zhanlei
    [J]. IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2009, : 473 - 477
  • [23] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hamarashid, Hozan K.
    Saeed, Soran A.
    Rashid, Tarik A.
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (09): : 4547 - 4566
  • [24] Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji
    Hozan K. Hamarashid
    Soran A. Saeed
    Tarik A. Rashid
    [J]. Neural Computing and Applications, 2021, 33 : 4547 - 4566
  • [25] IMPROVEMENTS TO N-GRAM LANGUAGE MODEL USING TEXT GENERATED FROM NEURAL LANGUAGE MODEL
    Suzuki, Masayuki
    Itoh, Nobuyasu
    Nagano, Tohru
    Kurata, Gakuto
    Thomas, Samuel
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7245 - 7249
  • [26] Combination of Random Indexing based Language Model and N-gram Language Model for Speech Recognition
    Fohr, Dominique
    Mella, Odile
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2231 - 2235
  • [27] Politics and the German language: Testing Orwell's hypothesis using the Google N-Gram corpus
    Caruana-Galizia, Paul
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2016, 31 (03) : 441 - 456
  • [28] Using the Google N-Gram corpus to measure cultural complexity
    Juola, Patrick
    [J]. LITERARY AND LINGUISTIC COMPUTING, 2013, 28 (04): : 668 - 675
  • [29] Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model
    Tian, Jinchuan
    Yu, Jianwei
    Weng, Chao
    Zou, Yuexian
    Yu, Dong
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 812 - 816
  • [30] Product Reviews based on Location using N-gram model
    Varma, Kajal S.
    Mahajan, Arpana
    Degadwala, Sheshang D.
    [J]. PROCEEDINGS OF THE 2018 3RD INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT 2018), 2018, : 100 - 104