Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

被引:33
|
作者
Zhu, Yongjun [1 ]
Yan, Erjia [2 ]
Wang, Fei [1 ]
机构
[1] Cornell Univ, Weill Cornell Med, Healthcare Policy & Res, New York, NY 10021 USA
[2] Drexel Univ, Coll Comp & Informat, Philadelphia, PA 19104 USA
关键词
Word2vec; Biomedical publications; PubMed; PubMed Central; Semantic similarity; Semantic relatedness; CLASSIFICATION; DOMAIN;
D O I
10.1186/s12911-017-0498-1
中图分类号
R-058 [];
学科分类号
摘要
Background: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec's ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. Methods: We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. Results: Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). Conclusions: Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
引用
收藏
页数:8
相关论文
共 8 条
  • [1] Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
    Yongjun Zhu
    Erjia Yan
    Fei Wang
    BMC Medical Informatics and Decision Making, 17
  • [2] Word Semantic Similarity Calculation Based on Word2vec
    Jin, Xiaolin
    Zhang, Shuwu
    Liu, Jie
    2018 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND INFORMATION SCIENCES (ICCAIS), 2018, : 12 - 16
  • [3] Word Clustering based on Word2vec and Semantic Similarity
    Luo Jie
    Wang Qinglin
    Li Yuan
    2014 33RD CHINESE CONTROL CONFERENCE (CCC), 2014, : 517 - 521
  • [4] Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text
    McInnes, Bridget T.
    Pedersen, Ted
    JOURNAL OF BIOMEDICAL INFORMATICS, 2013, 46 (06) : 1116 - 1124
  • [5] Query Auto-Completion Based on Word2vec Semantic Similarity
    Shao, Taihua
    Chen, Honghui
    Chen, Wanyu
    2ND INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2018), 2018, 1004
  • [6] Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity
    Sugathadasa, Keet
    Ayesha, Buddhi
    de Silva, Nisansa
    Perera, Amal Shehan
    Jayawardana, Vindula
    Lakmal, Dimuthu
    Perera, Madhavi
    2017 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL AND INFORMATION SYSTEMS (ICIIS), 2017, : 58 - 63
  • [7] Using Word2vec Technique to Determine Semantic and Morphologic Similarity in Embedded Words of the Ukrainian Language
    Savytska, Larysa
    Vnukova, Nataliya
    Bezugla, Iryna
    Pyvovarov, Vasyl
    Subay, M. Turgut
    COLINS 2021: COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS, VOL I, 2021, 2870
  • [8] Experimental Comparison of Pre-Trained Word Embedding Vectors of Word2Vec, Glove, FastText for Word Level Semantic Text Similarity Measurement in Turkish
    Tulu, Cagatay Neftali
    ADVANCES IN SCIENCE AND TECHNOLOGY-RESEARCH JOURNAL, 2022, 16 (04) : 147 - 156