Data-Driven Information Extraction from Chinese Electronic Medical Records

被引:12
|
作者
Xu, Dong [1 ]
Zhang, Meizhuo [2 ]
Zhao, Tianwan [1 ]
Ge, Chen [1 ]
Gao, Weiguo [2 ]
Wei, Jia [2 ]
Zhu, Kenny Q. [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] AstraZeneca, R&D Informat China, Shanghai 201203, Peoples R China
来源
PLOS ONE | 2015年 / 10卷 / 08期
关键词
NAMED ENTITY RECOGNITION;
D O I
10.1371/journal.pone.0136270
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Objective This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event. Materials and Methods Our framework uses a hybrid approach. It consists of constructing cross-domain core medical lexica, an unsupervised, iterative algorithm to accrue more accurate terms into the lexica, rules to address Chinese writing conventions and temporal descriptors, and a Support Vector Machine (SVM) algorithm that innovatively utilizes Normalized Google Distance (NGD) to estimate the correlation between medical events and their descriptions. Results The effectiveness of the framework was demonstrated with a dataset of 24,817 de-identified Chinese EMRs. The cross-domain medical lexica were capable of recognizing terms with an F1-score of 0.896. 98.5% of recorded medical events were linked to temporal descriptors. The NGD SVM description-event matching achieved an F1-score of 0.874. The end-to-end time-event-description extraction of our framework achieved an F1-score of 0.846. Discussion In terms of named entity recognition, the proposed framework outperforms state-of-the-art supervised learning algorithms (F1-score: 0.896 vs. 0.886). In event-description association, the NGD SVM is superior to SVM using only local context and semantic features (F1-score: 0.874 vs. 0.838). Conclusions The framework is data-driven, weakly supervised, and robust against the variations and noises that tend to occur in a large corpus. It addresses Chinese medical writing conventions and variations in writing styles through patterns used for discovering new terms and rules for updating the lexica.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Data-driven approach for creating synthetic electronic medical records
    Buczak, Anna L.
    Babin, Steven
    Moniz, Linda
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2010, 10
  • [2] Data-driven approach for creating synthetic electronic medical records
    Anna L Buczak
    Steven Babin
    Linda Moniz
    [J]. BMC Medical Informatics and Decision Making, 10
  • [3] An Automated Approach for Clinical Quantitative Information Extraction from Chinese Electronic Medical Records
    Liu, Shanshan
    Pan, Xiaoyi
    Chen, Boyu
    Gao, Dongfa
    Hao, Tianyong
    [J]. HEALTH INFORMATION SCIENCE (HIS 2018), 2018, 11148 : 98 - 109
  • [4] Data-driven approach for assessing utility of medical tests using electronic medical records
    Skrovseth, Stein Olav
    Augestad, Knut Magne
    Ebadollahi, Shahram
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 53 : 270 - 276
  • [5] Information extraction of Chinese medical electronic records via evolutionary neural architecture search
    Zhang, Tian
    Li, Nan
    Zhou, Yuee
    Cai, Wei
    Ma, Lianbo
    [J]. 2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 396 - 405
  • [6] Research on Structured Information Extraction Method of Electronic Medical Records of Traditional Chinese Medicine
    Xie, Jiadong
    He, Jiayi
    He, Weiming
    Hu, Chenjun
    Hu, Kongfa
    Jiang, Rongrong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1613 - 1616
  • [7] Joint Extraction of Events in Chinese Electronic Medical Records
    Wang, Jingnan
    Li, Jianqiang
    Zhu, Zhichao
    Zhao, Qing
    Yu, Yang
    Yang, Liyin
    Xu, Chun
    [J]. 2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, : 1924 - 1929
  • [8] EXTRACTION OF MEDICAL DATA FROM ELECTRONIC MEDICAL RECORDS USING NLP ALGORITHMS
    Gusev, Aleksandr V.
    Novitskiy, Roman E.
    Ivshin, Aleksandr A.
    Boldina, Juliia S.
    Shtykov, Aleksey S.
    Vasilev, Aleksey S.
    [J]. AD ALTA-JOURNAL OF INTERDISCIPLINARY RESEARCH, 2022, 12 (02): : 314 - 319
  • [9] Information Extraction for Intestinal Cancer Electronic Medical Records
    Wang, Sufen
    Pang, Minmin
    Pan, Changqing
    Yuan, Junyi
    Xu, Bo
    Du, Ming
    Zhang, Hong
    [J]. IEEE ACCESS, 2020, 8 : 125923 - 125934
  • [10] Extraction of risk factors for cardiovascular diseases from Chinese electronic medical records
    Su, Jia
    Hu, Jinpeng
    Jiang, Jingchi
    Xie, Jing
    Yang, Yang
    He, Bin
    Yang, Jinfeng
    Guan, Yi
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2019, 172 : 1 - 10