Enriching the Transfer Learning with Pre-Trained Lexicon Embedding for Low-Resource Neural Machine Translation

被引:18
|
作者
Maimaiti, Mieradilijiang [1 ]
Liu, Yang [1 ,2 ]
Luan, Huanbo [1 ]
Sun, Maosong [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing Natl Res Ctr Informat Sci & Technol, Inst Artificial Intelligence, Beijing 100084, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing Adv Innovat Ctr Language Resources, Beijing 100084, Peoples R China
基金
新加坡国家研究基金会; 中国国家自然科学基金; 国家重点研发计划;
关键词
artificial intelligence; natural language processing; neural network; machine translation; low-resource languages; transfer learning;
D O I
10.26599/TST.2020.9010029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most State-Of-The-Art (SOTA) Neural Machine Translation (NMT) systems today achieve outstanding results based only on large parallel corpora. The large-scale parallel corpora for high-resource languages is easily obtainable. However, the translation quality of NMT for morphologically rich languages is still unsatisfactory, mainly because of the data sparsity problem encountered in Low-Resource Languages (LRLs). In the low-resource NMT paradigm, Transfer Learning (TL) has been developed into one of the most efficient methods. It is difficult to train the model on high-resource languages to include the information in both parent and child models, as well as the initially trained model that only contains the lexicon features and word embeddings of the parent model instead of the child languages feature. In this work, we aim to address this issue by proposing the language-independent Hybrid Transfer Learning (HTL) method for LRLs by sharing lexicon embedding between parent and child languages without leveraging back translation or manually injecting noises. First, we train the High-Resource Languages (HRLs) as the parent model with its vocabularies. Then, we combine the parent and child language pairs using the oversampling method to train the hybrid model initialized by the previously parent model. Finally, we fine-tune the morphologically rich child model using a hybrid model. Besides, we explore some exciting discoveries on the original TL approach. Experimental results show that our model consistently outperforms five SOTA methods in two languages Azerbaijani (Az) and Uzbek (Uz). Meanwhile, our approach is practical and significantly better, achieving improvements of up to 4.94 and 4.84 BLEU points for low-resource child languages Az -> Zh and Uz -> Zh, respectively.
引用
收藏
页码:150 / 163
页数:14
相关论文
共 50 条
  • [1] Enriching the Transfer Learning with Pre-Trained Lexicon Embedding for Low-Resource Neural Machine Translation
    Mieradilijiang Maimaiti
    Yang Liu
    Huanbo Luan
    Maosong Sun
    [J]. Tsinghua Science and Technology, 2022, 27 (01) : 150 - 163
  • [2] Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation
    Luo, Gongxu
    Yang, Yating
    Yuan, Yang
    Chen, Zhanheng
    Ainiwaer, Aizimaiti
    [J]. IEEE ACCESS, 2019, 7 : 154157 - 154166
  • [3] Reinforced Curriculum Learning on Pre-Trained Neural Machine Translation Models
    Zhao, Mingjun
    Wu, Haijiang
    Niu, Di
    Wang, Xiaoli
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9652 - 9659
  • [4] A Joint Back-Translation and Transfer Learning Method for Low-Resource Neural Machine Translation
    Luo, Gong-Xu
    Yang, Ya-Ting
    Dong, Rui
    Chen, Yan-Hong
    Zhang, Wen-Bo
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020
  • [5] Meta-Learning for Low-Resource Neural Machine Translation
    Gu, Jiatao
    Wang, Yong
    Chen, Yun
    Cho, Kyunghyun
    Li, Victor O. K.
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3622 - 3631
  • [6] DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
    Li, Yu
    Peng, Baolin
    He, Pengcheng
    Galley, Michel
    Yu, Zhou
    Gao, Jianfeng
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1368 - 1386
  • [7] Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
    Lee, En-Shiun Annie
    Thillainathan, Sarubi
    Nayak, Shravan
    Ranathunga, Surangika
    Adelani, David Ifeoluwa
    Su, Ruisi
    McCarthy, Arya D.
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 58 - 67
  • [8] Pashto poetry generation: deep learning with pre-trained transformers for low-resource languages
    Ullah, Imran
    Ullah, Khalil
    Khan, Hamad
    Aurangzeb, Khursheed
    Anwar, Muhammad Shahid
    Syed, Ikram
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10
  • [9] Deep Fusing Pre-trained Models into Neural Machine Translation
    Weng, Rongxiang
    Yu, Heng
    Luo, Weihua
    Zhang, Min
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11468 - 11476
  • [10] Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models
    Lalrempuii, Candy
    Soni, Badal
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (08)