Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning

被引:22
|
作者
Zhang, Yaoyun [1 ]
Xu, Jun [1 ]
Chen, Hui [2 ]
Wang, Jingqi [1 ]
Wu, Yonghui [1 ]
Prakasam, Manu [3 ]
Xu, Hua [1 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA
[2] Capital Med Univ, Sch Biomed Engn, Beijing 100069, Peoples R China
[3] Mira Loma High Sch, Sacramento, CA 95821 USA
基金
美国国家卫生研究院;
关键词
HYBRID SYSTEM; INFORMATION; EXTRACTION; TEXT; DATABASE;
D O I
10.1093/database/baw049
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task. The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and a F-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient( MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Leveraging Integrated Learning for Open-Domain Chinese Named Entity Recognition
    Diao J.
    Zhou Z.
    Shi G.
    International Journal of Crowd Science, 2022, 6 (02) : 74 - 79
  • [32] A hybrid deep learning framework for bacterial named entity recognition with domain features
    Li, Xusheng
    Fu, Chengcheng
    Zhong, Ran
    Zhong, Duo
    He, Tingting
    Jiang, Xingpeng
    BMC BIOINFORMATICS, 2019, 20 (Suppl 16)
  • [33] Chinese named entity recognition in the furniture domain based on ERNIE and adversarial learning
    Song, Yang
    Jia, Yanhe
    Zhang, Jian
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2024,
  • [34] Named Entity Recognition Model Based on Feature Fusion
    Sun, Zhen
    Li, Xinfu
    INFORMATION, 2023, 14 (02)
  • [35] DOZEN: Cross-Domain Zero Shot Named Entity Recognition with Knowledge Graph
    Nguyen, Hoang Van
    Gelli, Francesco
    Poria, Soujanya
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1642 - 1646
  • [36] Learning In-context Learning for Named Entity Recognition
    Chen, Jiawei
    Lu, Yaojie
    Lin, Hongyu
    Lou, Jie
    Jia, Wei
    Dai, Dai
    Wu, Hua
    Cao, Boxi
    Han, Xianpei
    Sun, Le
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13661 - 13675
  • [37] A Survey on Deep Learning for Named Entity Recognition
    Li, Jing
    Sun, Aixin
    Han, Jianglei
    Li, Chenliang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (01) : 50 - 70
  • [38] Named entity recognition based on deep learning
    Ji Z.
    Kong D.
    Liu W.
    Dong W.
    Sang Y.
    Jisuanji Jicheng Zhizao Xitong/Computer Integrated Manufacturing Systems, CIMS, 2022, 28 (06): : 1603 - 1615
  • [39] Turkish Named Entity Recognition with Deep Learning
    Gunes, Asim
    Tantug, A. Cuneyd
    2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [40] Transfer Learning for Indonesian Named Entity Recognition
    Kosasih, Joshua Aditya
    Khodra, Masayu Leylia
    2018 INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT INFORMATICS (SAIN), 2018, : 173 - 178