Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning

被引:22
|
作者
Zhang, Yaoyun [1 ]
Xu, Jun [1 ]
Chen, Hui [2 ]
Wang, Jingqi [1 ]
Wu, Yonghui [1 ]
Prakasam, Manu [3 ]
Xu, Hua [1 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA
[2] Capital Med Univ, Sch Biomed Engn, Beijing 100069, Peoples R China
[3] Mira Loma High Sch, Sacramento, CA 95821 USA
基金
美国国家卫生研究院;
关键词
HYBRID SYSTEM; INFORMATION; EXTRACTION; TEXT; DATABASE;
D O I
10.1093/database/baw049
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task. The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and a F-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient( MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Unsupervised Ranking of Knowledge Bases for Named Entity Recognition
    Mrabet, Yassine
    Kilicoglu, Halil
    Demner-Fushman, Dina
    ECAI 2016: 22ND EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, 285 : 1248 - 1255
  • [2] Medical Named Entity Recognition with Domain Knowledge
    Pei W.
    Sun S.
    Li X.
    Lu J.
    Yang L.
    Wu Y.
    Data Analysis and Knowledge Discovery, 2023, 7 (03) : 142 - 154
  • [3] Character Feature Learning for Named Entity Recognition
    Zeng, Ping
    Tan, Qingping
    Zhang, Haoyu
    Meng, Xiankai
    Zhang, Zhuo
    Xu, Jianjun
    Lei, Yan
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (07) : 1811 - 1815
  • [4] Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
    Tsendsuren Munkhdalai
    Meijing Li
    Khuyagbaatar Batsuren
    Hyeon Ah Park
    Nak Hyeon Choi
    Keun Ho Ryu
    Journal of Cheminformatics, 7
  • [5] Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
    Munkhdalai, Tsendsuren
    Li, Meijing
    Batsuren, Khuyagbaatar
    Park, Hyeon Ah
    Choi, Nak Hyeon
    Ryu, Keun Ho
    JOURNAL OF CHEMINFORMATICS, 2015, 7
  • [6] Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings
    Zhai, Zenan
    Dat Quoc Nguyen
    Akhondi, Saber A.
    Thorne, Camilo
    Druckenbrodt, Christian
    Cohn, Trevor
    Gregory, Michelle
    Verspoor, Karin
    SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 328 - 338
  • [7] Domain Adaptation with Active Learning for Named Entity Recognition
    Sun, Huiyu
    Grishman, Ralph
    Wang, Yingchao
    CLOUD COMPUTING AND SECURITY, ICCCS 2016, PT II, 2016, 10040 : 611 - 622
  • [8] Named entity recognition in medical domain combined with knowledge graph
    Jin Z.
    He X.
    Yue S.
    Xiong Y.
    Luo J.
    Harbin Gongye Daxue Xuebao/Journal of Harbin Institute of Technology, 2023, 55 (05): : 50 - 58
  • [9] Named Entity Recognition in Biology Literature Based on Unsupervised Domain Adaptation Method
    Xu, Xingjian
    Liu, Fang
    Meng, Fanjun
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2022, PT III, 2022, 13370 : 426 - 437
  • [10] Unsupervised cross-domain named entity recognition using entity-aware adversarial training
    Peng, Qi
    Zheng, Changmeng
    Cai, Yi
    Wang, Tao
    Xie, Haoran
    Li, Qing
    NEURAL NETWORKS, 2021, 138 (138) : 68 - 77