Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation

被引:12
|
作者
Zan Hongying [1 ,3 ]
Li Wenxin [1 ]
Zhang Kunli [1 ,3 ]
Ye Yajuan [1 ]
Chang Baobao [2 ,3 ]
Sui Zhifang [2 ,3 ]
机构
[1] Zhengzhou Univ, Sch Informat Engn, Zhengzhou 450001, Henan, Peoples R China
[2] Peking Univ, Key Lab Computat Linguist, Minist Educ, Beijing 100871, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518052, Guangdong, Peoples R China
来源
关键词
Medical text; Medical word segmentation; Named entities; Annotation norms; Corpus construction; AGREEMENT;
D O I
10.1007/978-3-030-81197-6_55
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Word segmentation and named entity annotation are essential foundations for medical text information extraction. This paper focuses on clinical pediatric diseases and takes the existing medical named entities and entity-relationship labeling systems as references. Under the guidance of the Chinese word segmentation and named entity labeling, the specifications for pediatric medical texts have been constructed in this paper. This paper also applies a self-developed distributed annotation platform to pre-annotate and manually proofread the named entities for many times. The corpus consists of 38,805 medical entries which can be divided into nine categories. Among the medical entries, there are 504 entries of common pediatric diseases, 7,085 entries of body parts, 12,907 entries of clinical manifestations, and 4,354 entries of medical procedures. This paper constructs the largest corpus with pediatric medical word segmentation and named entity annotation, which provides a data basis for related research.
引用
收藏
页码:652 / 664
页数:13
相关论文
共 50 条
  • [1] Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
    Abiderexiti, Kahaerjiang
    Maimaiti, Maihemuti
    Yibulayin, Tuergen
    Wumaier, Aishan
    [J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 103 - 107
  • [2] Towards named entity annotation of Latvian National Library corpus
    Paikens, Peteris
    Auzina, Ilze
    Garkaje, Ginta
    Paegle, Madara
    [J]. HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 169 - 175
  • [3] Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
    Salah, Ramzi Esmail
    Zakaria, Lailatul Qadri Binti
    [J]. 2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 150 - 157
  • [4] Building a Corpus-Derived Gazetteer for Named Entity Recognition
    Zamin, Norshuhani
    Oxley, Alan
    [J]. SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 73 - 80
  • [5] Chinese word segmentation and named entity recognition: A pragmatic approach
    Gao, JF
    Li, M
    Wu, A
    Huang, CN
    [J]. COMPUTATIONAL LINGUISTICS, 2005, 31 (04) : 531 - 574
  • [6] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
    Tuan-An Dao
    Hung-Thinh Truong
    Long Nguyen
    Dien Dinh
    [J]. PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66
  • [7] A Method for Building a Labeled Named Entity Recognition Corpus Using Ontologies
    Ngoc-Trinh Vu
    Van-Hien Tran
    Thi-Huyen-Trang Doan
    Hoang-Quynh Le
    Mai-Vu Tran
    [J]. ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING, 2015, 358 : 141 - 149
  • [8] A French Corpus and Annotation Schema for Named Entity Recognition and Relation Extraction of Financial News
    Jabbari, Ali
    Sauvage, Olivier
    Zeine, Hamada
    Chergui, Hamza
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2293 - 2299
  • [9] A Corpus Study and Annotation Schema for Named Entity Recognition and Relation Extraction of Business Products
    Schoen, Saskia
    Mironova, Veselina
    Gabryszak, Aleksandra
    Hennig, Leonhard
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4445 - 4451
  • [10] Extended Named Entity Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign
    Galibert, Olivier
    Rosset, Sophie
    Grouin, Cyril
    Zweigenbaum, Pierre
    Quintard, Ludovic
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3126 - 3131