A Collection of Comparable Corpora for Under-resourced Languages

被引:6
|
作者
Skadina, Inguna
Aker, Ahmet
Giouli, Voula
Tufis, Dan
Gaizauskas, Robert
Mierina, Madara
Mastropavlos, Nikos
机构
关键词
Comparable corpora; under-resourced languages; comparability; metadata; crawling; statistical machine translation;
D O I
10.3233/978-1-60750-641-6-161
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language.
引用
收藏
页码:161 / 168
页数:8
相关论文
共 50 条
  • [21] Cross-Lingual Link Discovery for Under-Resourced Languages
    Rosner, Michael
    Ahmadi, Sina
    Apostol, Elena-Simona
    Bosque-Gil, Julia
    Chiarcos, Christian
    Dojchinovski, Milan
    Gkirtzou, Katerina
    Gracia, Jorge
    Gromann, Dagmar
    Liebeskind, Chaya
    Oleskeviene, Giedre Valunaite
    Serasset, Gilles
    Truica, Ciprian-Octavian
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 181 - 192
  • [22] WordNet construction for under-resourced languages using personalized PageRank
    Berangi, Parisa
    Mousavi, Zahra
    Faili, Heshaam
    Shakery, Azadeh
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (03) : 565 - 580
  • [23] Multi-task learning in under-resourced Dravidian languages
    Adeep Hande
    Siddhanth U. Hegde
    Bharathi Raja Chakravarthi
    Journal of Data, Information and Management, 2022, 4 (2): : 137 - 165
  • [24] A Phone Mapping Technique for Acoustic Modeling of Under-resourced Languages
    Van Hai Do
    Xiao, Xiong
    Chng, Eng Siong
    Li, Haizhou
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 233 - 236
  • [25] Text Spotting In Large Speech Databases For Under-Resourced Languages
    Buzo, Andi
    Cucu, Horia
    Burileanu, Corneliu
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [26] A Statistical Method for Translating Chinese into Under-resourced Minority Languages
    Chen, Lei
    Li, Miao
    Zhang, Jian
    Zhu, Zede
    Yang, Zhenxin
    MACHINE TRANSLATION, CWMT 2014, 2014, 493 : 49 - 60
  • [27] Automating the Creation of Speech Recognition Systems for Under-Resourced Languages
    Khusainov, Aidar
    Suleymanov, Dzhavdet
    2015 FOURTEENTH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI), 2015, : 28 - 32
  • [28] Crawl and crowd to bring machine translation to under-resourced languages
    Antonio Toral
    Miquel Esplá-Gomis
    Filip Klubička
    Nikola Ljubešić
    Vassilis Papavassiliou
    Prokopis Prokopidis
    Raphael Rubino
    Andy Way
    Language Resources and Evaluation, 2017, 51 : 1019 - 1051
  • [29] Network-Enabled Keyword Extraction for Under-Resourced Languages
    Beliga, Slobodan
    Martincic-Ipsic, Sanda
    SEMANTIC KEYWORD-BASED SEARCH ON STRUCTURED DATA SOURCES, IKC 2016, 2017, 10151 : 124 - 135
  • [30] Crawl and crowd to bring machine translation to under-resourced languages
    Toral, Antonio
    Espla-Gomis, Miquel
    Klubicka, Filip
    Ljubesic, Nikola
    Papavassiliou, Vassilis
    Prokopidis, Prokopis
    Rubino, Raphael
    Way, Andy
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 1019 - 1051