Word Embedding based Textual Semantic Similarity Measure in Bengali

被引:4
|
作者
Iqbal, Md Asif [1 ]
Sharif, Omar [1 ]
Hoque, Mohammed Moshiul [1 ]
Sarker, Iqbal H. [1 ]
机构
[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chattogram 4349, Bangladesh
关键词
Natural language processing; Textual semantic similarity; Word embedding; Cosine similarity; Part-of-speech weighting;
D O I
10.1016/j.procs.2021.10.010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Textual semantic similarity is a crucial constituent in many NLP tasks such as information retrieval, machine translation, information retrieval and textual forgery detection. It is a complicated task for rule-based techniques to address semantic similarity measures in low-resource languages due to the complex morphological structure and scarcity of linguistic resources. This paper investigates several word embedding techniques (Word2Vec, GloVe, FastText) to estimate the semantic similarity of Bengali sentences. Due to the unavailability of the standard dataset, this work developed a Bengali dataset containing 187031 text documents with 400824 unique words. Moreover, this work considers three semantic distance measures to compute the similarity between the word vectors using Cosine similarity with no weight, term frequency weighting and Part-of-Speech weighting. The performance of the proposed approach is evaluated on the developed dataset containing 50 pairs of Bengali sentences. The evaluation result shows that FastText with continuous bag-of-words with 100 vector size achieved the highest Pearson's correlation (rho) score of 77.28%. (C) 2021 The Authors. Published by Elsevier B.V.
引用
收藏
页码:92 / 101
页数:10
相关论文
共 50 条
  • [1] Semantic Textual Similarity in Bengali Text
    Shajalal, Md
    Aono, Masaki
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [2] Exploring Semantic Similarity Measure Based on Word Embedding Representation for Arabic Passages Retrieval
    Lahbari, Imane
    El Alaoui, Said Ouatik
    [J]. ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 2, 2022, 1418 : 978 - 989
  • [3] Deep learning based Bengali question answering system using semantic textual similarity
    Das, Arijit
    Saha, Diganta
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (01) : 589 - 613
  • [4] Deep learning based Bengali question answering system using semantic textual similarity
    Arijit Das
    Diganta Saha
    [J]. Multimedia Tools and Applications, 2022, 81 : 589 - 613
  • [5] Semantic Similarity of Inverse Morpheme Words Based on Word Embedding
    Zhou, Jiaomei
    Liu, Zhiying
    [J]. CHINESE LEXICAL SEMANTICS, CLSW 2021, PT I, 2022, 13249 : 452 - 463
  • [6] Word and Sentence Embedding Tools to Measure Semantic Similarity of Gene Ontology Terms by Their Definitions
    Duong, Dat
    Ahmad, Wasi Uddin
    Eskin, Eleazar
    Chang, Kai-Wei
    Li, Jingyi Jessica
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (01) : 38 - 52
  • [7] A novel model for semantic similarity measurement based on wordnet and word embedding
    Zhao, Fuqiang
    Zhu, Zhengyu
    Han, Ping
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (05) : 9831 - 9842
  • [8] An Improved Semantic Similarity Measure for Word Pairs
    Cai, Songmei
    Lu, Zhao
    [J]. 2010 INTERNATIONAL CONFERENCE ON E-EDUCATION, E-BUSINESS, E-MANAGEMENT AND E-LEARNING: IC4E 2010, PROCEEDINGS, 2010, : 212 - 216
  • [9] Combining Word Embedding and Semantic Lexicon for Chinese Word Similarity Computation
    Pei, Jiahuan
    Zhang, Cong
    Huang, Degen
    Ma, Jianjun
    [J]. NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 766 - 777
  • [10] An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents
    Li, Meijing
    Zhou, Xianhe
    Ryu, Keun Ho
    Theera-Umpon, Nipon
    [J]. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2022, 2022