Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

被引：0

作者：

Abdulmumin, Idris ^{[1
]}

Galadanci, Bashir Shehu ^{[2
]}

Muhammad, Shamsuddeen Hassan ^{[3
]}

Aliyu, Garba ^{[1
]}

机构：

[1] Ahmadu Bello Univ, Dept Comp Sci, Zaria, Nigeria

[2] Bayero Univ, Dept Software Engn, Kano, Nigeria

[3] Bayero Univ, Dept Comp Sci, Kano, Nigeria

来源：

2022 IEEE NIGERIA 4TH INTERNATIONAL CONFERENCE ON DISRUPTIVE TECHNOLOGIES FOR SUSTAINABLE DEVELOPMENT (IEEE NIGERCON) | 2022年

关键词：

self-learning; domain adaptation; quality estimation; machine translation; natural language processing;

D O I：

10.1109/NIGERCON54645.2022.9803137

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data in order to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach of exploiting such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, have a any affect on the performance of the translation model. Experiments have shown that on English-German low resource NMT, it is often better to select only the most useful additional data-based on quality or closeness to the domain of the test data-than utilizing all of the available data.

引用

下载

页码：26 / 30

页数：5