Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

被引:0
|
作者
Iyer, Vivek [1 ]
Oncevay, Arturo [1 ]
Birch, Alexandra [1 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh, Scotland
来源
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains - owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-tomany word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. Lastly, through ablation studies, we highlight the major code-switching aspects (including context, many-to-many substitutions, code-switching language count etc.) that contribute to the enhanced pretraining of multilingual NMT models.
引用
收藏
页码:984 / 998
页数:15
相关论文
共 50 条
  • [1] Machine Translation on a Parallel Code-Switched Corpus
    Menacer, M. A.
    Langlois, D.
    Jouvet, D.
    Fohr, D.
    Mella, O.
    Smaili, K.
    ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11489 : 426 - 432
  • [2] Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
    Gaser, Marwa
    Mager, Manuel
    Hamed, Injy
    Habash, Nizar
    Abdennadher, Slim
    Vu, Ngoc Thang
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3523 - 3538
  • [3] Improving Pretraining Techniques for Code-Switched NLP
    Das, Richeek
    Ranjan, Sahasra
    Pathak, Shreya
    Jyothi, Preethi
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1176 - 1191
  • [4] CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units
    Kang, Yeeun
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 363 - 375
  • [5] Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
    Hamed, Injy
    Habash, Nizar
    Ngoc Thang Vu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 140 - 154
  • [6] Code-Switching with Word Senses for Pretraining in Neural Machine Translation
    Iyer, Vivek
    Barba, Edoardo
    Birch, Alexandra
    Pan, Jeff Z.
    Navigli, Roberto
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 12889 - 12901
  • [7] From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text
    Tarunesh, Ishan
    Kumar, Syamantak
    Jyothi, Preethi
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3154 - 3169
  • [8] Exploring Unsupervised Pretraining Objectives for Machine Translation
    Baziotis, Christos
    Titov, Ivan
    Birch, Alexandra
    Haddow, Barry
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2956 - 2971
  • [9] Sentiment Analysis of Code-Switched Tunisian Dialect: Exploring RNN-Based Techniques
    Jerbi, Mohamed Amine
    Achour, Hadhemi
    Souissi, Emna
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 122 - 131
  • [10] Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation
    Wang, Wenxuan
    Jiao, Wenxiang
    Hao, Yongchang
    Wang, Xing
    Shi, Shuming
    Tu, Zhaopeng
    Lyu, Michael R.
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2591 - 2600