Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

被引：0

作者：

Iyer, Vivek ^{[1
]}

Oncevay, Arturo ^{[1
]}

Birch, Alexandra ^{[1
]}

机构：

[1] Univ Edinburgh, Sch Informat, Edinburgh, Scotland

来源：

17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains - owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-tomany word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. Lastly, through ablation studies, we highlight the major code-switching aspects (including context, many-to-many substitutions, code-switching language count etc.) that contribute to the enhanced pretraining of multilingual NMT models.

引用

页码：984 / 998

页数：15

共 50 条

[1] Machine Translation on a Parallel Code-Switched Corpus
Menacer, M. A.
Langlois, D.
Jouvet, D.
Fohr, D.
Mella, O.
Smaili, K.
ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11489 : 426 - 432
[2] Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
Gaser, Marwa
Mager, Manuel
Hamed, Injy
Habash, Nizar
Abdennadher, Slim
Vu, Ngoc Thang
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3523 - 3538
[3] Improving Pretraining Techniques for Code-Switched NLP
Das, Richeek
Ranjan, Sahasra
Pathak, Shreya
Jyothi, Preethi
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1176 - 1191
[4] CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units
Kang, Yeeun
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 363 - 375
[5] Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
Hamed, Injy
Habash, Nizar
Ngoc Thang Vu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 140 - 154
[6] Code-Switching with Word Senses for Pretraining in Neural Machine Translation
Iyer, Vivek
Barba, Edoardo
Birch, Alexandra
Pan, Jeff Z.
Navigli, Roberto
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 12889 - 12901
[7] From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text
Tarunesh, Ishan
Kumar, Syamantak
Jyothi, Preethi
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3154 - 3169
[8] Exploring Unsupervised Pretraining Objectives for Machine Translation
Baziotis, Christos
Titov, Ivan
Birch, Alexandra
Haddow, Barry
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2956 - 2971
[9] Sentiment Analysis of Code-Switched Tunisian Dialect: Exploring RNN-Based Techniques
Jerbi, Mohamed Amine
Achour, Hadhemi
Souissi, Emna
ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 122 - 131
[10] Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation
Wang, Wenxuan
Jiao, Wenxiang
Hao, Yongchang
Wang, Xing
Shi, Shuming
Tu, Zhaopeng
Lyu, Michael R.
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2591 - 2600

← 1 2 3 4 5 →