Developing a corpus of clinical notes manually annotated for part-of-speech

被引:19
|
作者
Pakhomov, Serguei V. [1 ]
Coden, Anni
Chute, Christopher G.
机构
[1] Mayo Coll Med, Div Biomed Informat, Rochester, MN 55905 USA
[2] IBM Corp, Thomas J Watson Res Ctr, Hawthorne, NY 10532 USA
关键词
natural language processing; statistical part-of-speech tagging; domain adaptation; medical domain; text analysis; manual text annotation;
D O I
10.1016/j.ijmedinf.2005.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general, English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text. (c) 2005 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:418 / 429
页数:12
相关论文
共 50 条
  • [1] An efficient tool for building a large part-of-speech annotated corpus
    Lim, HS
    Rim, HC
    [J]. IC-AI'2000: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 1-III, 2000, : 1225 - 1229
  • [2] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
    Maulud, Dastan
    Jacksi, Karwan
    Ali, Ismael
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
  • [3] Corpus based part-of-speech tagging
    Lv, Chengyao
    Liu, Huihua
    Dong, Yuanxing
    Chen, Yunliang
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654
  • [4] Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
    South, Brett R.
    Shen, Shuying
    Jones, Makoto
    Garvin, Jennifer
    Samore, Matthew H.
    Chapman, Wendy W.
    Gundlapalli, Adi V.
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [5] Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease
    Brett R South
    Shuying Shen
    Makoto Jones
    Jennifer Garvin
    Matthew H Samore
    Wendy W Chapman
    Adi V Gundlapalli
    [J]. BMC Bioinformatics, 10
  • [6] Building a Thai part-of-speech tagged corpus (ORCHID)
    Sornlertlamvanich, Virach
    Takahashi, Naoto
    Isahara, Hitoshi
    [J]. Journal of the Acoustical Society of Japan (E) (English translation of Nippon Onkyo Gakkaishi), 1999, 20 (03): : 189 - 198
  • [7] Building a Part-of-Speech Tagged Corpus for Drenjongke (Bhutia)
    Ashida, Mana
    Lee, Seunghun J.
    Namgyal, Kunzang
    [J]. AACL-IJCNLP 2020: THE 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2020, : 57 - 63
  • [8] A Manually Annotated Corpus of Pharmaceutical Patents
    Kiss, Marton
    Nagy, Agoston
    Vincze, Veronika
    Almasi, Attila
    Alexin, Zoltan
    Csirik, Janos
    [J]. TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 135 - 142
  • [9] PGxCorpus, a manually annotated corpus for pharmacogenomics
    Legrand, J.
    Gogdemir, R.
    Bousquet, C.
    Dalleau, K.
    Devignes, M. D.
    Digan, W.
    Lee, C. J.
    Ndiaye, N. C.
    Petitpain, N.
    Ringot, P.
    Smail-Tabbone, M.
    Toussaint, Y.
    Coulet, A.
    [J]. FUNDAMENTAL & CLINICAL PHARMACOLOGY, 2021, 35 : 195 - 196
  • [10] PGxCorpus, a manually annotated corpus for pharmacogenomics
    Legrand, Joel
    Gogdemir, Romain
    Bousquet, Cedric
    Dalleau, Kevin
    Devignes, Marie-Dominique
    Digan, William
    Lee, Chia-Ju
    Ndiaye, Ndeye-Coumba
    Petitpain, Nadine
    Ringot, Patrice
    Smail-Tabbone, Malika
    Toussaint, Yannick
    Coulet, Adrien
    [J]. SCIENTIFIC DATA, 2020, 7 (01)