Pre-Training on Mixed Data for Low-Resource Neural Machine Translation

被引:6
|
作者
Zhang, Wenbo [1 ,2 ,3 ]
Li, Xiao [1 ,2 ,3 ]
Yang, Yating [1 ,2 ,3 ]
Dong, Rui [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
关键词
neural machine translation; pre-training; low resource; word translation;
D O I
10.3390/info12030133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The pre-training fine-tuning mode has been shown to be effective for low resource neural machine translation. In this mode, pre-training models trained on monolingual data are used to initiate translation models to transfer knowledge from monolingual data into translation models. In recent years, pre-training models usually take sentences with randomly masked words as input, and are trained by predicting these masked words based on unmasked words. In this paper, we propose a new pre-training method that still predicts masked words, but randomly replaces some of the unmasked words in the input with their translation words in another language. The translation words are from bilingual data, so that the data for pre-training contains both monolingual data and bilingual data. We conduct experiments on Uyghur-Chinese corpus to evaluate our method. The experimental results show that our method can make the pre-training model have a better generalization ability and help the translation model to achieve better performance. Through a word translation task, we also demonstrate that our method enables the embedding of the translation model to acquire more alignment knowledge.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Extremely low-resource neural machine translation for Asian languages
    Rubino, Raphael
    Marie, Benjamin
    Dabre, Raj
    Fujita, Atushi
    Utiyama, Masao
    Sumita, Eiichiro
    [J]. MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
  • [42] Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
    Przystupa, Michael
    Abdul-Mageed, Muhammad
    [J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 224 - 235
  • [43] Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data
    Chowdhury, Koel Dutta
    Hasanuzzaman, Mohammed
    Liu, Qun
    [J]. DEEP LEARNING APPROACHES FOR LOW-RESOURCE NATURAL LANGUAGE PROCESSING (DEEPLO), 2018, : 33 - 42
  • [44] Pre-training via Leveraging Assisting Languages for Neural Machine Translation
    Song, Haiyue
    Dabre, Raj
    Mao, Zhuoyuan
    Cheng, Fei
    Kurohashi, Sadao
    Sumita, Eiichiro
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 279 - 285
  • [45] CSP: Code-Switching Pre-training for Neural Machine Translation
    Yang, Zhen
    Hu, Bojie
    Han, Ambyera
    Huang, Shen
    Ju, Qi
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2624 - 2636
  • [46] Exploring the Role of Monolingual Data in Cross-Attention Pre-training for Neural Machine Translation
    Khang Pham
    Long Nguyen
    Dien Dinh
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2023, 2023, 14162 : 179 - 190
  • [47] Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
    Ramesh, Akshai
    Uhana, Haque Usuf
    Parthasarathy, Venkatesh Balavadhani
    Haque, Rejwanul
    Way, Andy
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [48] Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
    Kang, Yu
    Liu, Tianqiao
    Li, Hang
    Hao, Yang
    Ding, Wenbiao
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10875 - 10883
  • [49] Survey of Low-Resource Machine Translation
    Haddow, Barry
    Bawden, Rachel
    Barone, Antonio Valerio Miceli
    Helcl, Jindrich
    Birch, Alexandra
    [J]. COMPUTATIONAL LINGUISTICS, 2022, 48 (03) : 673 - 732
  • [50] Simulated Multiple Reference Training Improves Low-Resource Machine Translation
    Khayrallah, Huda
    Thompson, Brian
    Post, Matt
    Koehn, Philipp
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 82 - 89