Generalized Data Augmentation for Low-Resource Translation

被引:0
|
作者
Xia, Mengzhou [1 ]
Kong, Xiang [1 ]
Anastasopoulos, Antonios [1 ]
Neubig, Graham [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Translation to or from low-resource languages (LRLs) poses challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing large amounts of monolingual data is regarded as an effective way to alleviate these problems. In this paper, we propose a general framework for data augmentation in low-resource machine translation that not only uses target-side monolingual data, but also pivots through a related high-resource language (HRL). Specifically, we experiment with a two-step pivoting method to convert high-resource data to the LRL, making use of available resources to better approximate the true data distribution of the LRL. First, we inject LRL words into HRL sentences through an induced bilingual dictionary. Second, we further edit these modified sentences using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.(1)
引用
收藏
页码:5786 / 5796
页数:11
相关论文
共 50 条
  • [21] Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques
    Thakkar, Gaurish
    Preradovic, Nives Mikelic
    Tadic, Marko
    ENG, 2024, 5 (04): : 2920 - 2942
  • [22] Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime
    Chen, Junfan
    Zhang, Richong
    Luo, Zheyan
    Hu, Chunming
    Mao, Yongyi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12626 - 12634
  • [23] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
    Huybrechts, Goeric
    Merritt, Thomas
    Comini, Giulia
    Perz, Bartek
    Shah, Raahil
    Lorenzo-Trueba, Jaime
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
  • [24] Low-Resource Comparative Opinion Quintuple Extraction by Data Augmentation with Prompting
    Xu, Qingting
    Hong, Yu
    Zhao, Fubang
    Song, Kaisong
    Kang, Yangyang
    Chen, Jiaxiang
    Zhou, Guodong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3892 - 3897
  • [25] Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
    Sahin, Goezde Guel
    Steedman, Mark
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5004 - 5009
  • [26] Survey of Low-Resource Machine Translation
    Haddow, Barry
    Bawden, Rachel
    Barone, Antonio Valerio Miceli
    Helcl, Jindrich
    Birch, Alexandra
    COMPUTATIONAL LINGUISTICS, 2022, 48 (03) : 673 - 732
  • [27] Terminology Translation in Low-Resource Scenarios
    Haque, Rejwanul
    Hasanuzzaman, Mohammed
    Way, Andy
    INFORMATION, 2019, 10 (09)
  • [28] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [29] Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language
    Saxena, Shefali
    Gupta, Ayush
    Daniel, Philemon
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64255 - 64269
  • [30] Translation Memories as Baselines for Low-Resource Machine Translation
    Knowles, Rebecca
    Littell, Patrick
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6759 - 6767