Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

被引:11
|
作者
Vashishth S. [1 ]
Newman-Griffis D. [2 ]
Joshi R. [1 ]
Dutt R. [1 ]
Rosé C.P. [1 ]
机构
[1] Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA
[2] University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Distant supervision; Entity typing; Information extraction; Medical concept normalization; Medical entity linking; Natural language processing;
D O I
10.1016/j.jbi.2021.103880
中图分类号
学科分类号
摘要
Objectives: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction—extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. Methods: We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MEDTYPE, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WIKIMED and PUBMEDDS, two large-scale datasets for medical entity linking. Results: Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MEDTYPE on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. Conclusions: Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [21] Semantic Entity-Relationship Model for Large-Scale Multimedia News Exploration and Recommendation
    Luo, Hangzai
    Cai, Peng
    Gong, Wei
    Fan, Jianping
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 522 - +
  • [22] DeepText2Go: Improving Large-scale Protein Function Prediction with Deep Semantic Text Representation
    You, Ronghui
    Zhu, Shanfeng
    2017 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2017, : 42 - 49
  • [23] Synergy conformal prediction applied to large-scale bioactivity datasets and in federated learning
    Ulf Norinder
    Ola Spjuth
    Fredrik Svensson
    Journal of Cheminformatics, 13
  • [24] Synergy conformal prediction applied to large-scale bioactivity datasets and in federated learning
    Norinder, Ulf
    Spjuth, Ola
    Svensson, Fredrik
    JOURNAL OF CHEMINFORMATICS, 2021, 13 (01)
  • [25] CarbonNet: Enterprise-Level Carbon Emission Prediction with Large-Scale Datasets
    Tang, Jinghua
    Fang, Nan
    Yang, Lanqing
    Pei, Yuqiao
    Wang, Ran
    Ding, Dian
    Lu, Yu
    Xue, Guangtao
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 411 - 422
  • [26] Applying latent semantic analysis to large-scale medical image databases
    Stathopoulos, Spyridon
    Kalamboukis, Theodore
    COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2015, 39 : 27 - 34
  • [27] Prediction of Intellectual Disability From Developmental Milestones in Large-Scale Autism Datasets
    Nadig, Ajay
    van der Merwe, Celia
    Robinson, Elise
    NEUROPSYCHOPHARMACOLOGY, 2021, 46 (SUPPL 1) : 135 - 135
  • [28] Generating a Large-Scale Entity Linking Dictionary from Wikipedia Link Structure and Article Text
    Harige, Ravindra
    Buitelaar, Paul
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2431 - 2434
  • [29] Retrieval From and Understanding of Large-Scale Multi-modal Medical Datasets: A Review
    Mueller, Henning
    Unay, Devrim
    IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2093 - 2104
  • [30] Improving Young Stroke Prediction by Learning with Active Data Augmenter in a Large-Scale Electronic Medical Claims Database
    Hung, Chen-Ying
    Lin, Ching-Heng
    Lee, Chi-Chun
    2018 40TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2018, : 5362 - 5365