Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

被引：26

作者：

He, Bin ^{[1
]}

Dong, Bin ^{[2
]}

Guan, Yi ^{[1
]}

Yang, Jinfeng ^{[3
]}

Jiang, Zhipeng ^{[1
]}

Yu, Qiubin ^{[4
]}

Cheng, Jianyi ^{[1
]}

Qu, Chunyan ^{[1
]}

机构：

[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Heilongjiang, Peoples R China

[2] Ricoh Software Res Ctr Beijing, Beijing, Peoples R China

[3] Harbin Univ Sci & Technol, Sch Software, Harbin, Heilongjiang, Peoples R China

[4] Harbin Med Univ, Affiliated Hosp 2, Med Records Room, Harbin, Heilongjiang, Peoples R China

来源：

JOURNAL OF BIOMEDICAL INFORMATICS | 2017年 / 69卷

关键词：

Chinese clinical texts; Corpus construction; Guideline development; Annotation method; Natural language processing; NAMED ENTITY RECOGNITION; INFORMATION;

D O I：

10.1016/j.jbi.2017.04.006

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Objective: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. Materials and methods: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. Results: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. Discussion: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. Conclusions: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. (C) 2017 Published by Elsevier Inc.

引用

页码：203 / 217

页数：15

共 50 条

[21] A Corpus-Based Analysis of Syntactic-Semantic Relations between Adjectival Objects and Nouns in Mandarin Chinese
Li, Lin
Liu, Pengyuan
CHINESE LEXICAL SEMANTICS, CLSW 2016, 2016, 10085 : 173 - 186
[22] The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions
Richard Futrell
Edward Gibson
Harry J. Tily
Idan Blank
Anastasia Vishnevetsky
Steven T. Piantadosi
Evelina Fedorenko
Language Resources and Evaluation, 2021, 55 : 63 - 77
[23] The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions
Futrell, Richard
Gibson, Edward
Tily, Harry J.
Blank, Idan
Vishnevetsky, Anastasia
Piantadosi, Steven T.
Fedorenko, Evelina
LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (01) : 63 - 77
[24] From syntactic-semantic tagging to knowledge discovery in medical texts
Ceusters, W
Spyns, P
De Moor, G
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 1998, 52 (1-3) : 149 - 157
[25] Chinese question classification combining syntactic and semantic feature
Yu, ZT
Fan, XZ
Song, LZ
Guo, JY
ISTM/2005: 6th International Symposium on Test and Measurement, Vols 1-9, Conference Proceedings, 2005, : 5134 - 5140
[26] A hybrid method for syntactic and semantic structure disambiguation for Chinese
Li, TQ
Yang, XF
Hong, QY
Li, SZ
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 847 - 852
[27] CORPUS-BASED SYNTACTIC-SEMANTIC GRAPH ANALYSIS: SEMANTIC DOMAINS OF THE CONCEPT FEELING
Perak, Benedikt
Kirigin, Tajana Ban
RASPRAVE, 2020, 46 (02): : 957 - 996
[28] Syntactic Categorization and Semantic Interpretation of Chinese Nominal Compounds
Wu, Taizhong
Liu, Jian
Tang, Xuri
Gu, Min
Gu, Yanhui
Zhou, Junsheng
Qu, Weiguang
NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 51 - 62
[29] A study on construction of Modern Chinese Semantic Corpus
Kang, Shiyong
Xu, Xiaoxing
Liu, Jinfeng
Sun, Maosong
Zhao, Wen
RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 40 - 45
[30] A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)
Leonardo Campillos
Louise Deléger
Cyril Grouin
Thierry Hamon
Anne-Laure Ligozat
Aurélie Névéol
Language Resources and Evaluation, 2018, 52 : 571 - 601

← 1 2 3 4 5 →