Developing a corpus of clinical notes manually annotated for part-of-speech

被引：19

作者：

Pakhomov, Serguei V. ^{[1
]}

Coden, Anni

Chute, Christopher G.

机构：

[1] Mayo Coll Med, Div Biomed Informat, Rochester, MN 55905 USA

[2] IBM Corp, Thomas J Watson Res Ctr, Hawthorne, NY 10532 USA

来源：

INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS | 2006年 / 75卷 / 06期

关键词：

natural language processing; statistical part-of-speech tagging; domain adaptation; medical domain; text analysis; manual text annotation;

D O I：

10.1016/j.ijmedinf.2005.08.006

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general, English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text. (c) 2005 Elsevier Ireland Ltd. All rights reserved.

引用

页码：418 / 429

页数：12

共 50 条

[1] An efficient tool for building a large part-of-speech annotated corpus
Lim, HS
Rim, HC
[J]. IC-AI'2000: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 1-III, 2000, : 1225 - 1229
[2] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
Maulud, Dastan
Jacksi, Karwan
Ali, Ismael
[J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
[3] Corpus based part-of-speech tagging
Lv, Chengyao
Liu, Huihua
Dong, Yuanxing
Chen, Yunliang
[J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654
[4] Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
South, Brett R.
Shen, Shuying
Jones, Makoto
Garvin, Jennifer
Samore, Matthew H.
Chapman, Wendy W.
Gundlapalli, Adi V.
[J]. BMC BIOINFORMATICS, 2009, 10
[5] Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
Brett R South
Shuying Shen
Makoto Jones
Jennifer Garvin
Matthew H Samore
Wendy W Chapman
Adi V Gundlapalli
[J]. BMC Bioinformatics, 10
[6] Building a Thai part-of-speech tagged corpus (ORCHID)
Sornlertlamvanich, Virach
Takahashi, Naoto
Isahara, Hitoshi
[J]. Journal of the Acoustical Society of Japan (E) (English translation of Nippon Onkyo Gakkaishi), 1999, 20 (03): : 189 - 198
[7] Building a Part-of-Speech Tagged Corpus for Drenjongke (Bhutia)
Ashida, Mana
Lee, Seunghun J.
Namgyal, Kunzang
[J]. AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 57 - 63
[8] A Manually Annotated Corpus of Pharmaceutical Patents
Kiss, Marton
Nagy, Agoston
Vincze, Veronika
Almasi, Attila
Alexin, Zoltan
Csirik, Janos
[J]. TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 135 - 142
[9] PGxCorpus, a manually annotated corpus for pharmacogenomics
Legrand, J.
Gogdemir, R.
Bousquet, C.
Dalleau, K.
Devignes, M. D.
Digan, W.
Lee, C. J.
Ndiaye, N. C.
Petitpain, N.
Ringot, P.
Smail-Tabbone, M.
Toussaint, Y.
Coulet, A.
[J]. FUNDAMENTAL & CLINICAL PHARMACOLOGY, 2021, 35 : 195 - 196
[10] PGxCorpus, a manually annotated corpus for pharmacogenomics
Legrand, Joel
Gogdemir, Romain
Bousquet, Cedric
Dalleau, Kevin
Devignes, Marie-Dominique
Digan, William
Lee, Chia-Ju
Ndiaye, Ndeye-Coumba
Petitpain, Nadine
Ringot, Patrice
Smail-Tabbone, Malika
Toussaint, Yannick
Coulet, Adrien
[J]. SCIENTIFIC DATA, 2020, 7 (01)

← 1 2 3 4 5 →