Annotating the MASC Corpus with BabelNet

被引:0
|
作者
Moro, Andrea [1 ]
Navigli, Roberto [1 ]
Tucci, Francesco Maria [1 ]
Passonneau, Rebecca J. [2 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Informat, I-00185 Rome, Italy
[2] Columbia Univ, Ctr Computat Learning Syst, New York, NY USA
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
Semantic Annotation; Named Entities; Word Senses; Lexical Ambiguity; Semantic Network; Disambiguation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.
引用
收藏
页码:4214 / 4219
页数:6
相关论文
共 50 条
  • [31] Classification of the Mask Augsburg Speech Corpus (MASC) Using the Consistency Learning Method
    Wang, Dezhi
    Zou, Dan
    Cheng, Xinghua
    Xiao, Wenbin
    2020 5TH INTERNATIONAL CONFERENCE ON COMMUNICATION, IMAGE AND SIGNAL PROCESSING (CCISP 2020), 2020, : 169 - 173
  • [32] IARG-AnCora: Annotating AnCora corpus with implicit arguments
    Taule, Mariona
    Antonia Marti, M.
    Penis, Aina
    Rodriguez, Horacio
    Moreno, Lidia
    Moreda, Paloma
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2012, (49): : 181 - 184
  • [33] A Transfer Learning Framework For Annotating Implementation-Specific Corpus
    Ponniah, Anbumunee
    Agarwal, Swati
    Ranka, Sharanya Milind
    Madhusudhan, Shashank
    2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2022, : 503 - 512
  • [34] Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC
    Vinogradova, Olga
    Lyashevskaya, Olga
    TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 77 - 88
  • [35] The UIR Uncertainty Corpus for Chinese: Annotating Chinese Microblog Corpus for Uncertainty Identification from Social Media
    Li, Binyang
    Xiang, Jun
    Chen, Le
    Han, Xu
    Yu, Xiaoyan
    Xu, Ruifeng
    Wang, Tengjiao
    Wong, Kam-fai
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 495 - 498
  • [36] Annotating progressive aspect constructions in the spoken section of the British National Corpus
    Caines, Andrew
    Buttery, Paula
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1699 - 1704
  • [37] PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
    Binkheder, Samar
    Wu, Heng-Yi
    Quinney, Sara K.
    Zhang, Shijun
    Zitu, Md Muntasir
    Chiang, Chien-Wei
    Wang, Lei
    Jones, Josette
    Li, Lang
    JOURNAL OF BIOMEDICAL SEMANTICS, 2022, 13 (01)
  • [38] Ten Years of BabelNet: A Survey
    Navigli, Roberto
    Bevilacqua, Michele
    Conia, Simone
    Montagnini, Dario
    Cecconi, Francesco
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4559 - 4567
  • [39] Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus
    Uryupina, Olga
    Artstein, Ron
    Bristot, Antonella
    Cavicchio, Federica
    Delogu, Francesca
    Rodriguez, Kepa J.
    Poesio, Massimo
    NATURAL LANGUAGE ENGINEERING, 2020, 26 (01) : 95 - 128
  • [40] Annotating Modality Expressions and Event Factuality for a Japanese Chess Commentary Corpus
    Matsuyoshi, Suguru
    Kameko, Hirotaka
    Murawaki, Yugo
    Mori, Shinsuke
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2475 - 2481