Human evaluation of web-crawled parallel corpora for machine translation

被引:0
|
作者
Ramirez-Sanchez, Gema [1 ]
Banon, Marta [1 ]
Zaragoza-Bernabeu, Jaume [1 ]
Ortiz-Rojas, Sergio [1 ]
机构
[1] Prompsit Language Engn, Elche, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 50 条
  • [21] Parallel Corpora Preparation for English-Amharic Machine Translation
    Biadgligne, Yohanens
    Smaili, Kamel
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2021, PT I, 2021, 12861 : 443 - 455
  • [22] Learning Curve with Machine Translation Based on Parallel, Bilingual Corpora
    Kowalski, Maciej
    [J]. MACHINE INTELLIGENCE AND BIG DATA IN INDUSTRY, 2016, 19 : 11 - 21
  • [23] Mining Parallel Resources for Machine Translation from Comparable Corpora
    Pal, Santanu
    Pakray, Partha
    Gelbukh, Alexander
    van Genabith, Josef
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT I, 2015, 9041 : 534 - 544
  • [24] Spatiotemporal Hotspots of Study Areas in Research of Gastric Cancer in China Based on Web-Crawled Literature
    Wang, Zhen
    Ren, Hongyan
    Zhang, An
    Zhuang, Dafang
    [J]. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2021, 18 (08)
  • [25] Improving machine translation performance by exploiting non-parallel corpora
    Munteanu, DS
    Marcu, D
    [J]. COMPUTATIONAL LINGUISTICS, 2005, 31 (04) : 477 - 504
  • [26] Parallel Corpora and Translation Teaching
    Bai, Jingang
    [J]. PROCEEDINGS OF THE 2016 6TH INTERNATIONAL CONFERENCE ON MECHATRONICS, COMPUTER AND EDUCATION INFORMATIONIZATION (MCEI 2016), 2016, 130 : 689 - 693
  • [27] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
    Kang, Wooyoung
    Mun, Jonghwan
    Lee, Sungjun
    Roh, Byungseok
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2930 - 2940
  • [28] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Schaefer, Roland
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 873 - 889
  • [29] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Roland Schäfer
    [J]. Language Resources and Evaluation, 2017, 51 : 873 - 889
  • [30] Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora
    Zhang, Jinyi
    Matsumoto, Tadahiro
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (10):