Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

被引:26
|
作者
He, Bin [1 ]
Dong, Bin [2 ]
Guan, Yi [1 ]
Yang, Jinfeng [3 ]
Jiang, Zhipeng [1 ]
Yu, Qiubin [4 ]
Cheng, Jianyi [1 ]
Qu, Chunyan [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Heilongjiang, Peoples R China
[2] Ricoh Software Res Ctr Beijing, Beijing, Peoples R China
[3] Harbin Univ Sci & Technol, Sch Software, Harbin, Heilongjiang, Peoples R China
[4] Harbin Med Univ, Affiliated Hosp 2, Med Records Room, Harbin, Heilongjiang, Peoples R China
关键词
Chinese clinical texts; Corpus construction; Guideline development; Annotation method; Natural language processing; NAMED ENTITY RECOGNITION; INFORMATION;
D O I
10.1016/j.jbi.2017.04.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. Materials and methods: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. Results: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. Discussion: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. Conclusions: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. (C) 2017 Published by Elsevier Inc.
引用
收藏
页码:203 / 217
页数:15
相关论文
共 50 条
  • [31] A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)
    Campillos, Leonardo
    Deleger, Louise
    Grouin, Cyril
    Hamon, Thierry
    Ligozat, Anne-Laure
    Neveol, Aurelie
    LANGUAGE RESOURCES AND EVALUATION, 2018, 52 (02) : 571 - 601
  • [32] Semantic-Syntactic Word Valence Vectors for Building a Taxonomy
    Marchenko, Oleksandr
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 222 - 229
  • [33] Semantic term weighting for clinical texts
    Matsuo, Ryosuke
    Tu Bao Ho
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 114 : 543 - 551
  • [34] Syntactic-semantic tagging of medical texts: The multi-TALE project
    Office Line Engineering NV, Zonnegem, Belgium
    不详
    不详
    Stud. Health Technol. Informatics, (1-178):
  • [35] Developing a Syntactic and Semantic Annotation Tool for Research on Chinese Vocabulary
    Wang, Shan
    Liu, Xiaojun
    Zhou, Jie
    CHINESE LEXICAL SEMANTICS, CLSW 2021, PT II, 2022, 13250 : 272 - 294
  • [36] Brain mechanisms for syntactic and semantic processing by Chinese and English bilinguals
    Shieh, CC
    Luke, KK
    Tan, LH
    Wai, YY
    Wan, YL
    Liu, HL
    NEUROIMAGE, 2001, 13 (06) : S602 - S602
  • [37] Semantic Analysis and Automatic Corpus Construction for Entailment Recognition in Medical Texts
    Ben Abacha, Asma
    Duy Dinh
    Mrabet, Yassine
    ARTIFICIAL INTELLIGENCE IN MEDICINE (AIME 2015), 2015, 9105 : 238 - 242
  • [38] Building a Corpus of Manually Revised Texts from Discourse Perspective
    Iida, Ryu
    Tokunaga, Takenobu
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 936 - 941
  • [39] Exploiting Syntactic and Semantic Information in Coarse Chinese Question Classification
    Kang, Xin
    Wang, Xiaojie
    Ren, Fuji
    IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 174 - +
  • [40] A Statistical Approach with Syntactic and Semantic Features for Chinese Textual Entailment
    Tu, Chun
    Day, Min-Yuh
    2012 IEEE 13TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2012, : 59 - 64