Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation

被引：12

作者：

Zan Hongying ^{[1
,3
]}

Li Wenxin ^{[1
]}

Zhang Kunli ^{[1
,3
]}

Ye Yajuan ^{[1
]}

Chang Baobao ^{[2
,3
]}

Sui Zhifang ^{[2
,3
]}

机构：

[1] Zhengzhou Univ, Sch Informat Engn, Zhengzhou 450001, Henan, Peoples R China

[2] Peking Univ, Key Lab Computat Linguist, Minist Educ, Beijing 100871, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518052, Guangdong, Peoples R China

来源：

CHINESE LEXICAL SEMANTICS (CLSW 2020) | 2021年 / 12278卷

关键词：

Medical text; Medical word segmentation; Named entities; Annotation norms; Corpus construction; AGREEMENT;

D O I：

10.1007/978-3-030-81197-6_55

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Word segmentation and named entity annotation are essential foundations for medical text information extraction. This paper focuses on clinical pediatric diseases and takes the existing medical named entities and entity-relationship labeling systems as references. Under the guidance of the Chinese word segmentation and named entity labeling, the specifications for pediatric medical texts have been constructed in this paper. This paper also applies a self-developed distributed annotation platform to pre-annotate and manually proofread the named entities for many times. The corpus consists of 38,805 medical entries which can be divided into nine categories. Among the medical entries, there are 504 entries of common pediatric diseases, 7,085 entries of body parts, 12,907 entries of clinical manifestations, and 4,354 entries of medical procedures. This paper constructs the largest corpus with pediatric medical word segmentation and named entity annotation, which provides a data basis for related research.

引用

页码：652 / 664

页数：13

共 50 条

[1] Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
Abiderexiti, Kahaerjiang
Maimaiti, Maihemuti
Yibulayin, Tuergen
Wumaier, Aishan
[J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 103 - 107
[2] Towards named entity annotation of Latvian National Library corpus
Paikens, Peteris
Auzina, Ilze
Garkaje, Ginta
Paegle, Madara
[J]. HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 169 - 175
[3] Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
Salah, Ramzi Esmail
Zakaria, Lailatul Qadri Binti
[J]. 2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 150 - 157
[4] Building a Corpus-Derived Gazetteer for Named Entity Recognition
Zamin, Norshuhani
Oxley, Alan
[J]. SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 73 - 80
[5] Chinese word segmentation and named entity recognition: A pragmatic approach
Gao, JF
Li, M
Wu, A
Huang, CN
[J]. COMPUTATIONAL LINGUISTICS, 2005, 31 (04) : 531 - 574
[6] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
Tuan-An Dao
Hung-Thinh Truong
Long Nguyen
Dien Dinh
[J]. PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66
[7] A Method for Building a Labeled Named Entity Recognition Corpus Using Ontologies
Ngoc-Trinh Vu
Van-Hien Tran
Thi-Huyen-Trang Doan
Hoang-Quynh Le
Mai-Vu Tran
[J]. ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING, 2015, 358 : 141 - 149
[8] A French Corpus and Annotation Schema for Named Entity Recognition and Relation Extraction of Financial News
Jabbari, Ali
Sauvage, Olivier
Zeine, Hamada
Chergui, Hamza
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2293 - 2299
[9] A Corpus Study and Annotation Schema for Named Entity Recognition and Relation Extraction of Business Products
Schoen, Saskia
Mironova, Veselina
Gabryszak, Aleksandra
Hennig, Leonhard
[J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4445 - 4451
[10] Extended Named Entity Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign
Galibert, Olivier
Rosset, Sophie
Grouin, Cyril
Zweigenbaum, Pierre
Quintard, Ludovic
[J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3126 - 3131

← 1 2 3 4 5 →