Annotating the MASC Corpus with BabelNet

被引:0
|
作者
Moro, Andrea [1 ]
Navigli, Roberto [1 ]
Tucci, Francesco Maria [1 ]
Passonneau, Rebecca J. [2 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Informat, I-00185 Rome, Italy
[2] Columbia Univ, Ctr Computat Learning Syst, New York, NY USA
关键词
Semantic Annotation; Named Entities; Word Senses; Lexical Ambiguity; Semantic Network; Disambiguation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.
引用
收藏
页码:4214 / 4219
页数:6
相关论文
共 50 条
  • [1] MASC: MASSIVE ARABIC SPEECH CORPUS
    Al-Fetyani, Mohammad
    Al-Barham, Muhammad
    Abandah, Gheith
    Alsharkawi, Adham
    Dawas, Maha
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1006 - 1013
  • [2] The MASC Word Sense Sentence Corpus
    Passonneau, Rebecca J.
    Baker, Collin
    Fellbaum, Christiane
    Ide, Nancy
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3025 - 3030
  • [3] Annotating Events in an Emotion Corpus
    Lee, Sophia Yat Mei
    Li, Shoushan
    Huang, Chu-Ren
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3511 - 3516
  • [4] Annotating an Arabic Learner Corpus for Error
    Abuhakema, Ghazi
    Faraj, Reem
    Feldman, Anna
    Fitzpatrick, Eileen
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1347 - 1350
  • [5] FrSemCor: Annotating a French corpus with supersenses
    Barque, L.
    Haas, P.
    Huyghe, R.
    Tribout, D.
    Candito, M.
    Crabbe, B.
    Segonne, V
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5904 - 5910
  • [6] Annotating Arguments in a Corpus of Opinion Articles
    Rocha, Gil
    Trigo, Luis
    Cardoso, Henrique Lopes
    Sousa-Silva, Rui
    Carvalho, Paula
    Martins, Bruno
    Won, Miguel
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1890 - 1899
  • [7] Annotating Arguments in a Parliamentary Corpus: An Experience
    Koit, Mare
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KEOD), VOL 2, 2020, : 213 - 218
  • [8] Annotating Errors in a Hungarian Learner Corpus
    Dickinson, Markus
    Ledbetter, Scott
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1659 - 1664
  • [9] John of Scythopolis and the Dionysian corpus: Annotating the Areopagite
    Beggiani, S
    THEOLOGICAL STUDIES, 2000, 61 (01) : 188 - 189
  • [10] Annotating the Enron Email Corpus with Number Senses
    Moore, Stuart
    Buchholz, Sabine
    Korhonen, Anna
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1452 - 1455