A Collection of Comparable Corpora for Under-resourced Languages

被引：6

作者：

Skadina, Inguna

Aker, Ahmet

Giouli, Voula

Tufis, Dan

Gaizauskas, Robert

Mierina, Madara

Mastropavlos, Nikos

机构：

来源：

HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE | 2010年 / 219卷

关键词：

Comparable corpora; under-resourced languages; comparability; metadata; crawling; statistical machine translation;

D O I：

10.3233/978-1-60750-641-6-161

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language.

引用

页码：161 / 168

页数：8

共 50 条

[21] Cross-Lingual Link Discovery for Under-Resourced Languages
Rosner, Michael
Ahmadi, Sina
Apostol, Elena-Simona
Bosque-Gil, Julia
Chiarcos, Christian
Dojchinovski, Milan
Gkirtzou, Katerina
Gracia, Jorge
Gromann, Dagmar
Liebeskind, Chaya
Oleskeviene, Giedre Valunaite
Serasset, Gilles
Truica, Ciprian-Octavian
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 181 - 192
[22] WordNet construction for under-resourced languages using personalized PageRank
Berangi, Parisa
Mousavi, Zahra
Faili, Heshaam
Shakery, Azadeh
DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2021, 36 (03) : 565 - 580
[23] Multi-task learning in under-resourced Dravidian languages
Adeep Hande
Siddhanth U. Hegde
Bharathi Raja Chakravarthi
Journal of Data, Information and Management, 2022, 4 (2): : 137 - 165
[24] A Phone Mapping Technique for Acoustic Modeling of Under-resourced Languages
Van Hai Do
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 233 - 236
[25] Text Spotting In Large Speech Databases For Under-Resourced Languages
Buzo, Andi
Cucu, Horia
Burileanu, Corneliu
2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
[26] A Statistical Method for Translating Chinese into Under-resourced Minority Languages
Chen, Lei
Li, Miao
Zhang, Jian
Zhu, Zede
Yang, Zhenxin
MACHINE TRANSLATION, CWMT 2014, 2014, 493 : 49 - 60
[27] Automating the Creation of Speech Recognition Systems for Under-Resourced Languages
Khusainov, Aidar
Suleymanov, Dzhavdet
2015 FOURTEENTH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI), 2015, : 28 - 32
[28] Crawl and crowd to bring machine translation to under-resourced languages
Antonio Toral
Miquel Esplá-Gomis
Filip Klubička
Nikola Ljubešić
Vassilis Papavassiliou
Prokopis Prokopidis
Raphael Rubino
Andy Way
Language Resources and Evaluation, 2017, 51 : 1019 - 1051
[29] Network-Enabled Keyword Extraction for Under-Resourced Languages
Beliga, Slobodan
Martincic-Ipsic, Sanda
SEMANTIC KEYWORD-BASED SEARCH ON STRUCTURED DATA SOURCES, IKC 2016, 2017, 10151 : 124 - 135
[30] Crawl and crowd to bring machine translation to under-resourced languages
Toral, Antonio
Espla-Gomis, Miquel
Klubicka, Filip
Ljubesic, Nikola
Papavassiliou, Vassilis
Prokopidis, Prokopis
Rubino, Raphael
Way, Andy
LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) : 1019 - 1051

← 1 2 3 4 5 →