Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

被引:26
|
作者
He, Bin [1 ]
Dong, Bin [2 ]
Guan, Yi [1 ]
Yang, Jinfeng [3 ]
Jiang, Zhipeng [1 ]
Yu, Qiubin [4 ]
Cheng, Jianyi [1 ]
Qu, Chunyan [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Heilongjiang, Peoples R China
[2] Ricoh Software Res Ctr Beijing, Beijing, Peoples R China
[3] Harbin Univ Sci & Technol, Sch Software, Harbin, Heilongjiang, Peoples R China
[4] Harbin Med Univ, Affiliated Hosp 2, Med Records Room, Harbin, Heilongjiang, Peoples R China
关键词
Chinese clinical texts; Corpus construction; Guideline development; Annotation method; Natural language processing; NAMED ENTITY RECOGNITION; INFORMATION;
D O I
10.1016/j.jbi.2017.04.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. Materials and methods: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. Results: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. Discussion: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. Conclusions: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. (C) 2017 Published by Elsevier Inc.
引用
收藏
页码:203 / 217
页数:15
相关论文
共 50 条
  • [21] A Corpus-Based Analysis of Syntactic-Semantic Relations between Adjectival Objects and Nouns in Mandarin Chinese
    Li, Lin
    Liu, Pengyuan
    CHINESE LEXICAL SEMANTICS, CLSW 2016, 2016, 10085 : 173 - 186
  • [22] The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions
    Richard Futrell
    Edward Gibson
    Harry J. Tily
    Idan Blank
    Anastasia Vishnevetsky
    Steven T. Piantadosi
    Evelina Fedorenko
    Language Resources and Evaluation, 2021, 55 : 63 - 77
  • [23] The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions
    Futrell, Richard
    Gibson, Edward
    Tily, Harry J.
    Blank, Idan
    Vishnevetsky, Anastasia
    Piantadosi, Steven T.
    Fedorenko, Evelina
    LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (01) : 63 - 77
  • [24] From syntactic-semantic tagging to knowledge discovery in medical texts
    Ceusters, W
    Spyns, P
    De Moor, G
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 1998, 52 (1-3) : 149 - 157
  • [25] Chinese question classification combining syntactic and semantic feature
    Yu, ZT
    Fan, XZ
    Song, LZ
    Guo, JY
    ISTM/2005: 6th International Symposium on Test and Measurement, Vols 1-9, Conference Proceedings, 2005, : 5134 - 5140
  • [26] A hybrid method for syntactic and semantic structure disambiguation for Chinese
    Li, TQ
    Yang, XF
    Hong, QY
    Li, SZ
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 847 - 852
  • [27] CORPUS-BASED SYNTACTIC-SEMANTIC GRAPH ANALYSIS: SEMANTIC DOMAINS OF THE CONCEPT FEELING
    Perak, Benedikt
    Kirigin, Tajana Ban
    RASPRAVE, 2020, 46 (02): : 957 - 996
  • [28] Syntactic Categorization and Semantic Interpretation of Chinese Nominal Compounds
    Wu, Taizhong
    Liu, Jian
    Tang, Xuri
    Gu, Min
    Gu, Yanhui
    Zhou, Junsheng
    Qu, Weiguang
    NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 51 - 62
  • [29] A study on construction of Modern Chinese Semantic Corpus
    Kang, Shiyong
    Xu, Xiaoxing
    Liu, Jinfeng
    Sun, Maosong
    Zhao, Wen
    RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 40 - 45
  • [30] A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)
    Leonardo Campillos
    Louise Deléger
    Cyril Grouin
    Thierry Hamon
    Anne-Laure Ligozat
    Aurélie Névéol
    Language Resources and Evaluation, 2018, 52 : 571 - 601