A Collection of Comparable Corpora for Under-resourced Languages

被引:6
|
作者
Skadina, Inguna
Aker, Ahmet
Giouli, Voula
Tufis, Dan
Gaizauskas, Robert
Mierina, Madara
Mastropavlos, Nikos
机构
关键词
Comparable corpora; under-resourced languages; comparability; metadata; crawling; statistical machine translation;
D O I
10.3233/978-1-60750-641-6-161
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language.
引用
收藏
页码:161 / 168
页数:8
相关论文
共 50 条
  • [1] Eigentrigraphemes for under-resourced languages
    Ko, Tom
    Mak, Brian
    SPEECH COMMUNICATION, 2014, 56 : 132 - 141
  • [2] The LREMap for Under-Resourced Languages
    Del Gratta, Riccardo
    Frontini, Francesca
    Khan, Anas Fahad
    Mariani, Joseph
    Soria, Claudia
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [3] Collecting and annotating corpora for three under-resourced languages of France: Methodological issues
    Bernhard, Delphine
    Ligozat, Anne-Laure
    Bras, Myriam
    Martin, Fanny
    Vergez-Couret, Marianne
    Erhart, Pascale
    Sibille, Jean
    Todirascu, Amalia
    de Mareuil, Philippe Boula
    Huck, Dominique
    LANGUAGE DOCUMENTATION & CONSERVATION, 2021, 15 : 316 - 357
  • [4] Automatic processing of under-resourced languages
    Bernhard, Delphine
    Soria, Claudia
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2018, 59 (03): : 7 - 14
  • [5] ASR and translation for under-resourced languages
    Besacier, L.
    Le, V-B.
    Boitet, C.
    Berment, V.
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 6079 - 6082
  • [6] Linguistic Linked Open Data and Under-Resourced Languages: From Collection to Application
    Moran, Steven
    Chiarcos, Christian
    DEVELOPMENT OF LINGUISTIC LINKED OPEN DATA RESOURCES FOR COLLABORATIVE DATA-INTENSIVE RESEARCH IN THE LANGUAGE SCIENCES, 2019, : 39 - 68
  • [7] A smartphone-based ASR data collection tool for under-resourced languages
    de Vries, Nic J.
    Davel, Marelie H.
    Badenhorst, Jaco
    Basson, Willem D.
    de Wet, Febe
    Barnard, Etienne
    de Waal, Alta
    SPEECH COMMUNICATION, 2014, 56 : 119 - 131
  • [8] POS Tagging without a Tagger: Using Aligned Corpora for Transferring Knowledge to Under-Resourced Languages
    Khemakhem, Ines Turki
    Jamoussi, Salma
    Ben Hamadou, Abdelmajid
    COMPUTACION Y SISTEMAS, 2016, 20 (04): : 667 - 679
  • [9] Modeling under-resourced languages for speech recognition
    Kurimo, Mikko
    Enarvi, Seppo
    Tilk, Ottokar
    Varjokallio, Matti
    Mansikkaniemi, Andre
    Alumae, Tanel
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 961 - 987
  • [10] Modeling under-resourced languages for speech recognition
    Mikko Kurimo
    Seppo Enarvi
    Ottokar Tilk
    Matti Varjokallio
    André Mansikkaniemi
    Tanel Alumäe
    Language Resources and Evaluation, 2017, 51 : 961 - 987