Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

被引:26
|
作者
He, Bin [1 ]
Dong, Bin [2 ]
Guan, Yi [1 ]
Yang, Jinfeng [3 ]
Jiang, Zhipeng [1 ]
Yu, Qiubin [4 ]
Cheng, Jianyi [1 ]
Qu, Chunyan [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Heilongjiang, Peoples R China
[2] Ricoh Software Res Ctr Beijing, Beijing, Peoples R China
[3] Harbin Univ Sci & Technol, Sch Software, Harbin, Heilongjiang, Peoples R China
[4] Harbin Med Univ, Affiliated Hosp 2, Med Records Room, Harbin, Heilongjiang, Peoples R China
关键词
Chinese clinical texts; Corpus construction; Guideline development; Annotation method; Natural language processing; NAMED ENTITY RECOGNITION; INFORMATION;
D O I
10.1016/j.jbi.2017.04.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. Materials and methods: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. Results: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. Discussion: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. Conclusions: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. (C) 2017 Published by Elsevier Inc.
引用
收藏
页码:203 / 217
页数:15
相关论文
共 50 条
  • [41] Dependency-based syntactic analysis of Chinese and annotation of parsed corpus
    Lai, TBY
    Huang, CN
    38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2000, : 255 - 262
  • [42] Readability Assessment of Chinese Linguistic Texts Based on Dependent Syntactic Networks
    Zhao J.
    Applied Mathematics and Nonlinear Sciences, 2024, 9 (01)
  • [43] Design of Chinese Corpus Based on Semantic Mining Algorithm
    Wu, Di
    ADVANCES IN MULTIMEDIA, 2022, 2022
  • [44] Classification of Chinese Texts Based on Recognition of Semantic Topics
    Chen, Ye-wang
    Zhou, Qing
    Luo, Wei
    Du, Ji-Xiang
    COGNITIVE COMPUTATION, 2016, 8 (01) : 114 - 124
  • [45] Classification of Chinese Texts Based on Recognition of Semantic Topics
    Ye-wang Chen
    Qing Zhou
    Wei Luo
    Ji-Xiang Du
    Cognitive Computation, 2016, 8 : 114 - 124
  • [46] Causative verb constructions in Swedish and Dutch A corpus-based syntactic semantic study
    Epstein, Brett Jocelyn
    NORDIC JOURNAL OF LINGUISTICS, 2010, 33 (01) : 93 - 95
  • [47] A new semantically annotated corpus with syntactic-semantic and cross-lingual senses
    Rakho, Myriam
    Laporte, Eric
    Constant, Matthieu
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 597 - 600
  • [48] Syntactic-Semantic Classes of Context-Sensitive Synonyms Based on a Bilingual Corpus
    Uresova, Zdenka
    Fucikova, Eva
    Hajicova, Eva
    Hajic, Jan
    HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017, 2020, 12598 : 242 - 255
  • [49] Automatic Semantic Role Labeling on Non-revised Syntactic Trees of Journalistic Texts
    Hartmann, Nathan Siegle
    Duran, Magali Sanches
    Aluisio, Sandra Maria
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE (PROPOR 2016), 2016, 9727 : 202 - 212
  • [50] NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts
    Lima, Salvador
    Perez, Naiara
    Cuadros, Montse
    Rigau, German
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5772 - 5781