Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

被引:26
|
作者
He, Bin [1 ]
Dong, Bin [2 ]
Guan, Yi [1 ]
Yang, Jinfeng [3 ]
Jiang, Zhipeng [1 ]
Yu, Qiubin [4 ]
Cheng, Jianyi [1 ]
Qu, Chunyan [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Heilongjiang, Peoples R China
[2] Ricoh Software Res Ctr Beijing, Beijing, Peoples R China
[3] Harbin Univ Sci & Technol, Sch Software, Harbin, Heilongjiang, Peoples R China
[4] Harbin Med Univ, Affiliated Hosp 2, Med Records Room, Harbin, Heilongjiang, Peoples R China
关键词
Chinese clinical texts; Corpus construction; Guideline development; Annotation method; Natural language processing; NAMED ENTITY RECOGNITION; INFORMATION;
D O I
10.1016/j.jbi.2017.04.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. Materials and methods: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. Results: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. Discussion: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. Conclusions: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. (C) 2017 Published by Elsevier Inc.
引用
收藏
页码:203 / 217
页数:15
相关论文
共 50 条
  • [1] Building a semantically annotated corpus of clinical texts
    Roberts, Angus
    Gaizauskas, Robert
    Hepple, Mark
    Demetriou, George
    Guo, Yikun
    Roberts, Ian
    Setzer, Andrea
    JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) : 950 - 966
  • [2] Towards comprehensive syntactic and semantic annotations of the clinical narrative
    Albright, Daniel
    Lanfranchi, Arrick
    Fredriksen, Anwen
    Styler, William F.
    Warner, Colin
    Hwang, Jena D.
    Choi, Jinho D.
    Dligach, Dmitriy
    Nielsen, Rodney D.
    Martin, James
    Ward, Wayne
    Palmer, Martha
    Savova, Guergana K.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (05) : 922 - 930
  • [3] Correspondence between syntactic and semantic components in modern Chinese based on labeled corpus
    Xu, Xiaoxing
    Kang, Shiyong
    RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 52 - 57
  • [4] Syntactic and Semantic Parallelism in Ptolemaic Hieroglyphic Texts
    Madkour, Haitham
    ZEITSCHRIFT FUR AGYPTISCHE SPRACHE UND ALTERTUMSKUNDE, 2022, 149 (01): : 60 - 71
  • [5] Semantic annotation of (Czech) corpus texts
    Pala, K
    TEXT, SPEECH AND DIALOGUE, 1999, 1692 : 56 - 61
  • [6] Shallow syntactic analysis of Chinese texts
    Yu Chuqiao
    Bessmertny, I. A.
    2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE & COMMUNICATION TECHNOLOGY (CICT), 2017,
  • [7] Joint Syntactic and Semantic Parsing of Chinese
    Li, Junhui
    Zhou, Guodong
    Ng, Hwee Tou
    ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2010, : 1108 - 1117
  • [8] ANALYZING CHILDRENS TEXTS - SEMANTIC-SYNTACTIC PERSPECTIVE
    MITCHELL, JN
    ROCKY MOUNTAIN REVIEW OF LANGUAGE AND LITERATURE, 1978, 32 (03): : 166 - 166
  • [9] Problems of Semantic and Syntactic Modeling of Verbs in the Tibetan Corpus
    Kramskova A.
    Smirnova M.
    SN Computer Science, 3 (6)
  • [10] Building Semantic Corpus from WordNet
    Stanchev, Lubomir
    2012 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS (BIBMW), 2012,