Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

被引:11
|
作者
Vashishth S. [1 ]
Newman-Griffis D. [2 ]
Joshi R. [1 ]
Dutt R. [1 ]
Rosé C.P. [1 ]
机构
[1] Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA
[2] University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Distant supervision; Entity typing; Information extraction; Medical concept normalization; Medical entity linking; Natural language processing;
D O I
10.1016/j.jbi.2021.103880
中图分类号
学科分类号
摘要
Objectives: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. Methods: We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MEDTYPE, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WIKIMED and PUBMEDDS, two large-scale datasets for medical entity linking. Results: Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MEDTYPE on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. Conclusions: Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [1] MedDialog: Large-scale Medical Dialogue Datasets
    Zeng, Guangtao
    Yang, Wenmian
    Ju, Zeqian
    Yang, Yue
    Wang, Sicheng
    Zhang, Ruisi
    Zhou, Meng
    Zeng, Jiaqi
    Dong, Xiangyu
    Zhang, Ruoyu
    Fang, Hongchao
    Zhu, Penghui
    Chen, Shu
    Xie, Pengtao
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9241 - 9250
  • [2] Large-scale neural biomedical entity linking with layer overwriting
    Tsujimura, Tomoki
    Miwa, Makoto
    Sasaki, Yutaka
    JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 143
  • [3] OAG: Toward Linking Large-scale Heterogeneous Entity Graphs
    Zhang, Fanjin
    Liu, Xiao
    Tang, Jie
    Dong, Yuxiao
    Yao, Peiran
    Zhang, Jie
    Gu, Xiaotao
    Wang, Yan
    Shao, Bin
    Li, Rui
    Wang, Kuansan
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 2585 - 2595
  • [4] Exploring Large-scale Public Medical Image Datasets
    Oakden-Rayner, Luke
    ACADEMIC RADIOLOGY, 2020, 27 (01) : 106 - 112
  • [5] OPERATING LARGE-SCALE, BROAD COVERAGE CA AND SDI SERVICES
    DAVISON, PS
    INFORMATION SCIENTIST, 1972, 6 (01): : 15 - 31
  • [6] Improving large-scale search engines with semantic annotations
    Fuentes-Lorenzo, Damaris
    Fernandez, Norberto
    Fisteus, Jesus A.
    Sanchez, Luis
    EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (06) : 2287 - 2296
  • [7] Hybrid semantic recommender system for chemical compounds in large-scale datasets
    Marcia Barros
    Andre Moitinho
    Francisco M. Couto
    Journal of Cheminformatics, 13
  • [8] Hybrid semantic recommender system for chemical compounds in large-scale datasets
    Barros, Marcia
    Moitinho, Andre
    Couto, Francisco M.
    JOURNAL OF CHEMINFORMATICS, 2021, 13 (01)
  • [9] A collective entity linking algorithm with parallel computing on large-scale knowledge base
    Xia, Yingchun
    Wang, Xingyue
    Gu, Lichuan
    Gao, Qijuan
    Jiao, Jun
    Wang, Chao
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (02): : 948 - 963
  • [10] A collective entity linking algorithm with parallel computing on large-scale knowledge base
    Yingchun Xia
    Xingyue Wang
    Lichuan Gu
    Qijuan Gao
    Jun Jiao
    Chao Wang
    The Journal of Supercomputing, 2020, 76 : 948 - 963