CSP: Code-Switching Pre-training for Neural Machine Translation

被引:0
|
作者
Yang, Zhen [1 ]
Hu, Bojie [1 ]
Han, Ambyera [1 ]
Huang, Shen [1 ]
Ju, Qi [1 ]
机构
[1] Tencent Minor Mandarin Translat, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a new pre-training method, called Code-Switching Pre-training (CSP for short) for Neural Machine Translation (NMT). Unlike traditional pre-training method which randomly masks some fragments of the input sentence, the proposed CSP randomly replaces some words in the source sentence with their translation words in the target language. Specifically, we firstly perform lexicon induction with unsupervised word embedding mapping between the source and target languages, and then randomly replace some words in the input sentence with their translation words according to the extracted translation lexicons. CSP adopts the encoderdecoder framework: its encoder takes the codemixed sentence as input, and its decoder predicts the replaced fragment of the input sentence. In this way, CSP is able to pre-train the NMT model by explicitly making the most of the cross-lingual alignment information extracted from the source and target monolingual corpus. Additionally, we relieve the pretrainfinetune discrepancy caused by the artificial symbols like [mask]. To verify the effectiveness of the proposed method, we conduct extensive experiments on unsupervised and supervised NMT. Experimental results show that CSP achieves significant improvements over baselines without pre-training or with other pre-training methods.
引用
收藏
页码:2624 / 2636
页数:13
相关论文
共 50 条
  • [21] The specific character of code-switching translation
    Emelyanova, Yana Borisovna
    [J]. VESTNIK SANKT-PETERBURGSKOGO UNIVERSITETA-YAZYK I LITERATURA, 2019, 16 (02): : 214 - 228
  • [22] Context-Interactive Pre-Training for Document Machine Translation
    Yang, Pengcheng
    Zhang, Pei
    Chen, Boxing
    Xie, Jun
    Luo, Weihua
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3589 - 3595
  • [23] Exploring the Role of Monolingual Data in Cross-Attention Pre-training for Neural Machine Translation
    Khang Pham
    Long Nguyen
    Dien Dinh
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2023, 2023, 14162 : 179 - 190
  • [24] Recognition and Translation of Code-switching Speech Utterances
    Nakayama, Sahoko
    Kano, Takatomo
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    [J]. 2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 34 - 39
  • [25] Ensemble of Neural Networks with Sentiment Words Translation for Code-Switching Emotion Detection
    Yue, Tianchi
    Chen, Chen
    Zhang, Shaowu
    Lin, Hongfei
    Yang, Liang
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2018, PT II, 2018, 11109 : 411 - 419
  • [26] Character-Aware Low-Resource Neural Machine Translation with Weight Sharing and Pre-training
    Cao, Yichao
    Li, Miao
    Feng, Tao
    Wang, Rujing
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 321 - 333
  • [27] XLIT: A Method to Bridge Task Discrepancy in Machine Translation Pre-training
    Pham, Khang
    Nguyen, Long
    Dinh, Dien
    [J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23 (10)
  • [28] From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text
    Tarunesh, Ishan
    Kumar, Syamantak
    Jyothi, Preethi
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3154 - 3169
  • [29] Cross-lingual Visual Pre-training for Multimodal Machine Translation
    Caglayan, Ozan
    Kuyu, Menekse
    Amac, Mustafa Sercan
    Madhyastha, Pranava
    Erdem, Erkut
    Erdem, Aykut
    Specia, Lucia
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1317 - 1324
  • [30] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation
    Mao, Zhuoyuan
    Chu, Chenhui
    Kurohashi, Sadao
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (04)