Large Multi-lingual, Multi-level and Multi-genre Annotation Corpus

被引:0
|
作者
Li, Xuansong [1 ]
Palmer, Martha
Xue, Nianwen
Ramshaw, Lance
Maamouri, Mohamed
Bies, Ann
Conger, Kathryn Summerville
Grimes, Stephen
Strassel, Stephanie
机构
[1] Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA
关键词
machine translation; parallel aligned Treebank; word alignment; PropBank; co-reference;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program at DARPA (Defense Advanced Research Projects Agency) successfully addressed the internet information with enhanced NLP systems. BOLT aims for automated translation and linguistic analysis for informal genres of text and speech in online and in-person communication. As a part of this program, the Linguistic Data Consortium (LDC) developed valuable linguistic resources in support of the training and evaluation of such new technologies. This paper focuses on methodologies, infrastructure, and procedure for developing linguistic annotation at various language levels, including Treebank (TB), word alignment (WA), PropBank (PB), and co-reference (CoRef). Inspired by the OntoNotes approach with adaptations to the tasks to reflect the goals and scope of the BOLT project, this effort has introduced more annotation types of informal and free-style genres in English, Chinese and Egyptian Arabic. The corpus produced is by far the largest multi-lingual, multi-level and multi-genre annotation corpus of informal text and speech.
引用
下载
收藏
页码:906 / 913
页数:8
相关论文
共 50 条
  • [1] Sentence and Clause Level Emotion Annotation, Detection, and Classification in a Multi-Genre Corpus
    Tafreshi, Shabnam
    Diab, Mona
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1246 - 1251
  • [2] The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic
    Abdulrahim, Dana
    Inoue, Go
    Shamsan, Latifa
    Khalifa, Salam
    Habash, Nizar
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2345 - 2352
  • [3] A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic
    Cotterell, Ryan
    Callison-Burch, Chris
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [4] Building a Corpus of Multi-Lingual and Multi-Format International Investment Agreements
    Sugisaki, Kyoko
    Volk, Martin
    Polanco, Rodrigo
    Alschner, Wolfgang
    Skougarevskiy, Dmitriy
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 294 : 203 - 206
  • [5] JS']JSPEECH: A MULTI-LINGUAL CONVERSATIONAL SPEECH CORPUS
    Choobbasti, Ali Janalizadeh
    Gholamian, Mohammad Erfan
    Vaheb, Amir
    Safavi, Saeid
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 927 - 933
  • [6] The CAMOMILE Collaborative Annotation Platform for Multi-modal, Multi-lingual and Multi-media Documents
    Poignant, Johann
    Budnik, Mateusz
    Bredin, Herve
    Barras, Claude
    Stefas, Mickael
    Bruneau, Pierrick
    Adda, Gilles
    Besacier, Laurent
    Ekenel, Hazim
    Francopoulo, Gil
    Hernando, Javier
    Mariani, Joseph
    Morros, Ramon
    Quenot, Georges
    Rosset, Sophie
    Tamisier, Thomas
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1421 - 1425
  • [7] Multi-lingual threading
    Kind, A
    Padget, J
    PROCEEDINGS OF THE SIXTH EUROMICRO WORKSHOP ON PARALLEL AND DISTRIBUTED PROCESSING - PDP '98, 1998, : 431 - 437
  • [8] MULTI-LINGUAL INTERPRETATION
    ROSENNE, S
    ISRAEL LAW REVIEW, 1971, 6 (03) : 360 - 366
  • [9] MULTI-LINGUAL SCHOLAR
    BOLTON, W
    COMPUTERS AND THE HUMANITIES, 1989, 23 (03): : 263 - 265
  • [10] Large Scale Multi-Lingual Multi-Modal Summarization Dataset
    Verma, Yash
    Jangra, Anubhav
    Kumar, Raghvendra
    Saha, Sriparna
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3620 - 3632