Human evaluation of web-crawled parallel corpora for machine translation

被引:0
|
作者
Ramirez-Sanchez, Gema [1 ]
Banon, Marta [1 ]
Zaragoza-Bernabeu, Jaume [1 ]
Ortiz-Rojas, Sergio [1 ]
机构
[1] Prompsit Language Engn, Elche, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 50 条
  • [1] Comparing web-crawled and traditional corpora
    Václav Cvrček
    Zuzana Komrsková
    David Lukeš
    Petra Poukarová
    Anna Řehořková
    Adrian Jan Zasina
    Vladimír Benko
    [J]. Language Resources and Evaluation, 2020, 54 : 713 - 745
  • [2] Comparing web-crawled and traditional corpora
    Cvrcek, Vaclav
    Komrskova, Zuzana
    Lukes, David
    Poukarova, Petra
    Rehorkova, Anna
    Zasina, Adrian Jan
    Benko, Vladimir
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (03) : 713 - 745
  • [3] The Aranea Corpora Family: Ten plus Years of Processing Web-Crawled Data
    Benko, Vladimir
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 55 - 70
  • [4] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
    Marco Baroni
    Silvia Bernardini
    Adriano Ferraresi
    Eros Zanchetta
    [J]. Language Resources and Evaluation, 2009, 43 : 209 - 226
  • [5] WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation
    Zhang, Jinyi
    Tian, Ye
    Mao, Jiannan
    Han, Mei
    Matsumoto, Tadahiro
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (12):
  • [6] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
    Baroni, Marco
    Bernardini, Silvia
    Ferraresi, Adriano
    Zanchetta, Eros
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2009, 43 (03) : 209 - 226
  • [7] WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation
    Zhang, Jinyi
    Tian, Ye
    Mao, Jiannan
    Han, Mei
    Wen, Feng
    Guo, Cong
    Gao, Zhonghui
    Matsumoto, Tadahiro
    [J]. ELECTRONICS, 2023, 12 (05)
  • [8] WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus
    Zhang, Jinyi
    Su, Ke
    Tian, Ye
    Matsumoto, Tadahiro
    [J]. ELECTRONICS, 2024, 13 (07)
  • [9] In the melting pot of web-crawled texts: The challenges of extracting English words from Croatian corpora
    Colakovac, Jasmina Jelcic
    Borucinsky, Mirjana
    [J]. INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2024, 34 (01) : 166 - 182
  • [10] Web-based parallel corpora for statistical machine translation
    Li, Bo
    Liu, Juan
    Shi, Wenjuan
    [J]. ICMLA 2007: SIXTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2007, : 444 - 449