QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

被引:0
|
作者
Otegi, Arantxa [1 ]
Aranberri, Nora [1 ]
Branco, Antonio [3 ]
Hajic, Jan [2 ]
Neale, Steven [3 ]
Osenova, Petya [4 ]
Pereira, Rita [3 ]
Popel, Martin [2 ]
Silva, Joao [3 ]
Simov, Kiril [4 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country, UPV EHU, IXA Grp, Leioa, Spain
[2] Charles Univ Prague, UFAL, Fac Math & Phys, Prague, Czech Republic
[3] Univ Lisbon, Lisbon, Portugal
[4] BAS, IICT, Sofia, Bulgaria
关键词
annotated parallel corpora; named-entity disambiguation; word sense disambiguation; coreference; COREFERENCE RESOLUTION;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.
引用
收藏
页码:3023 / 3030
页数:8
相关论文
共 35 条
  • [1] Crossing parallel corpora and multilingual lexical databases for WSD
    Gliozzo, AM
    Ranieri, M
    Strapparava, C
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 242 - 245
  • [2] Automated annotation of parallel bible corpora with cross-lingual semantic concordance
    Doerpinghaus, Jens
    [J]. NATURAL LANGUAGE ENGINEERING, 2024, 30 (06) : 1277 - 1300
  • [3] Semantic annotation of French corpora: animacy and verb semantic classes
    Thuilier, Juliette
    Danlos, Laurence
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1533 - 1537
  • [4] Creating Multilingual Parallel Corpora in Indian Languages
    Choudhary, Narayan
    Jha, Girish Nath
    [J]. HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 527 - 537
  • [5] Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation
    Dorr, Bonnie J.
    Passonneau, Rebecca J.
    Farwell, David
    Green, Rebecca
    Habash, Nizar
    Helmreich, Stephen
    Hovy, Eduard
    Levin, Lori
    Miller, Keith J.
    Mitamura, Teruko
    Rambow, Owen
    Siddharthan, Advaith
    [J]. NATURAL LANGUAGE ENGINEERING, 2010, 16 : 197 - 243
  • [6] Parallel corpora as tools for investigating and developing minority languages
    Trosterud, T
    [J]. PARALLEL CORPORA, PARALLEL WORLDS, 2002, (43): : 111 - 122
  • [7] A Multilingual Parallel Corpora Collection Effort for Indian Languages
    Siripragada, Shashank
    Philip, Jerin
    Namboodiri, Vinay P.
    Jawahar, C., V
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3743 - 3751
  • [8] AnCora-Verb: A Lexical Resource for the Semantic Annotation of Corpora
    Aparicio, Juan
    Taule, Mariona
    Antonia Marti, Ma
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 797 - 802
  • [9] Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora
    Santos, Diana
    Mota, Cristina
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1437 - 1444
  • [10] A Web Tool for Building Parallel Corpora of Spoken and Sign Languages
    Becker, Alex
    Kepler, Fabio
    Candeias, Sara
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1438 - 1445