Annotating the MASC Corpus with BabelNet

被引:0
|
作者
Moro, Andrea [1 ]
Navigli, Roberto [1 ]
Tucci, Francesco Maria [1 ]
Passonneau, Rebecca J. [2 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Informat, I-00185 Rome, Italy
[2] Columbia Univ, Ctr Computat Learning Syst, New York, NY USA
关键词
Semantic Annotation; Named Entities; Word Senses; Lexical Ambiguity; Semantic Network; Disambiguation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.
引用
收藏
页码:4214 / 4219
页数:6
相关论文
共 50 条
  • [41] MASC
    不详
    BMJ-BRITISH MEDICAL JOURNAL, 1977, 2 (6097): : 1246 - 1246
  • [42] Annotating the TCD D-ANS Corpus - A Multimodal Multimedia Monolingual Biometric Corpus of Spoken Social Interaction
    Campbell, Nick
    Hennig, Shannon
    MULTIMODAL ANALYSES ENABLING ARTIFICIAL AGENTS IN HUMAN-MACHINE INTERACTION, 2015, 8757 : 3 - 12
  • [43] Annotating a corpus of human interaction with prosodic profiles - focusing on Mandarin repair/disfluency
    Chen, Helen Kai-yun
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 986 - 990
  • [44] The Causal News Corpus: Annotating Causal Relations in Event Sentences from News
    Tan, Fiona Anting
    Hurriyetoglu, Ali
    Caselli, Tommaso
    Oostdijk, Nelleke
    Nomoto, Tadashi
    Hettiarachchi, Hansi
    Ameer, Iqra
    Uca, Onur
    Liza, Farhana Ferdousi
    Hu, Tiancheng
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2298 - 2310
  • [45] Poster: Extracting and Annotating Mental Health Forum Corpus: A Comprehensive Validation Pipeline
    Jonnalagadda, Rohith Sundar
    Azmee, Abm Adnan
    Attota, Dinesh
    Khan, Md Abdullah Al Hafiz
    Pei, Yong
    Nandan, Monica
    2024 IEEE/ACM CONFERENCE ON CONNECTED HEALTH: APPLICATIONS, SYSTEMS AND ENGINEERING TECHNOLOGIES, CHASE 2024, 2024, : 208 - 209
  • [46] Correction: PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
    Samar Binkheder
    Heng-Yi Wu
    Sara K. Quinney
    Shijun Zhang
    Md. Muntasir Zitu
    Chien-Wei Chiang
    Lei Wang
    Josette Jones
    Lang Li
    Journal of Biomedical Semantics, 13
  • [47] Annotating thematic features in English and Spanish: A contrastive corpus-based study
    Arus, Jorge
    Lavid, Julia
    Moraton, Lara
    LINGUISTICS AND THE HUMAN SCIENCES, 2010, 6 (1-3): : 173 - 192
  • [48] Like Finding a Needle in a Haystack: Annotating the American National Corpus for Idiomatic Expressions
    Street, Laura
    Michalov, Nathan
    Silverstein, Rachel
    Reynolds, Michael
    Ruela, Lurdes
    Flowers, Felicia
    Talucci, Angela
    Pereira, Priscilla
    Morgon, Gabriella
    Siegel, Samantha
    Barousse, Marci
    Anderson, Antequa
    Carroll, Tashom
    Feldman, Anna
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [49] The challenges and benefits of annotating oral bilingual corpora The Spanish in Texas Corpus Project
    Bullock, Barbara E.
    Serigos, Jacqueline
    Toribio, Almeida Jacqueline
    Wendorf, Arthur
    LINGUISTIC VARIATION, 2018, 18 (01) : 100 - 119
  • [50] Making school writing visible and legible. Transcribing and annotating a corpus of text
    Doquet, Claire
    Ponton, Claude
    E-CALM COLLOQUIUM: ANALYSING LARGE SCHOOL AND UNIVERSITY CORPORA: QUESTIONS FOR RESEARCH AND TRAINING, E-CALM 2022, 2024, 186