Corpus annotation for mining biomedical events from literature

被引:155
|
作者
Kim, Jin-Dong [1 ]
Ohta, Tomoko [1 ]
Tsujii, Jun'ichi [1 ,2 ,3 ]
机构
[1] Univ Tokyo, Sch Informat Sci & Technol, Dept Comp Sci, Tokyo, Japan
[2] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
[3] Univ Manchester, Natl Ctr Text Min, Manchester, Lancs, England
关键词
D O I
10.1186/1471-2105-9-10
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion: The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Corpus annotation for mining biomedical events from literature
    Jin-Dong Kim
    Tomoko Ohta
    Jun'ichi Tsujii
    BMC Bioinformatics, 9
  • [2] Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining
    Harte, R
    Lu, Y
    Osborn, S
    Dehoney, D
    Chin, D
    PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE, 2003, : 644 - 645
  • [3] Examining the Effect of the Ratio of Biomedical Domain to General Domain Data in Corpus in Biomedical Literature Mining
    Zhang, Ziheng
    Han, Feng
    Zhang, Hongjian
    Aoki, Tomohiro
    Ogasawara, Katsuhiko
    APPLIED SCIENCES-BASEL, 2022, 12 (01):
  • [4] Inference Annotation of a Chinese Corpus for Opinion Mining
    Yan, Liyun
    Danni, E.
    Gan, Mei
    Grouin, Cyril
    Valette, Mathieu
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4991 - 4999
  • [5] Biomedical literature mining
    Hu, Xiaohua
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 1446 - 1446
  • [6] Mining impactful discoveries from the biomedical literature
    Moreau, Erwan
    Hardiman, Orla
    Heverin, Mark
    O'Sullivan, Declan
    BMC BIOINFORMATICS, 2024, 25 (01):
  • [7] A survey on annotation tools for the biomedical literature
    Neves, Mariana
    Leser, Ulf
    BRIEFINGS IN BIOINFORMATICS, 2014, 15 (02) : 327 - 340
  • [8] PALMER: improving pathway annotation based on the biomedical literature mining with a constrained latent block model
    Jin Hyun Nam
    Daniel Couch
    Willian A. da Silveira
    Zhenning Yu
    Dongjun Chung
    BMC Bioinformatics, 21
  • [9] PALMER: improving pathway annotation based on the biomedical literature mining with a constrained latent block model
    Nam, Jin Hyun
    Couch, Daniel
    da Silveira, Willian A.
    Yu, Zhenning
    Chung, Dongjun
    BMC BIOINFORMATICS, 2020, 21 (01)
  • [10] Annotation of a German Legal Decision Corpus for Argumentation Mining
    Kuhn, Florian
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 279 : 183 - 184