A Data Augmentation Method for English-Vietnamese Neural Machine Translation

被引:4
|
作者
Pham, Nghia Luan [1 ]
Nguyen, Van Vinh [2 ]
Pham, Thang Viet [3 ]
机构
[1] Hai Phong Univ, Lib & Informat Ctr, Haiphong, Vietnam
[2] Vietnam Natl Univ, Univ Engn & Technol, Fac Informat Technol, Hanoi, Vietnam
[3] VU, Univ Med Ctr, Amsterdam, Netherlands
关键词
Machine translation; Data models; Internet; Grammar; Decoding; Symbols; Error correction; Data augmentation; Error analysis; back-translation; machine translation; Vietnamese grammatical error correction;
D O I
10.1109/ACCESS.2023.3252898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The translation quality of machine translation systems depends on the parallel corpus used for training, particularly on the quantity and quality of the corpus. However, building a high-quality and large-scale parallel corpus is complex and expensive, particularly for a specific domain-parallel corpus. Therefore, data augmentation techniques are widely used in machine translation. The input of the back-translation method is monolingual text, which is available from many sources, and therefore, this method can be easily and effectively implemented to generate synthetic parallel data. In practice, monolingual texts can be collected from different sources, in which sources from websites often contain errors in grammar and spelling, sentence mismatch, or freestyle. Therefore, the quality of the output translation is reduced, leading to a low-quality parallel corpus generated by back-translation. In this study, we proposed a method for improving the quality of monolingual texts for back-translation. Moreover, we supplemented the data by pruning the translation table. We experimented with an English-Vietnamese neural machine translation using the IWSLT2015 dataset for training and testing in the legal domain. The results showed that the proposed method can effectively augment parallel data for machine translation, thereby improving translation quality. In our experimental cases, the BLEU score increased by 16.37 points compared with the baseline system.
引用
收藏
页码:28034 / 28044
页数:11
相关论文
共 50 条
  • [21] A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation
    Linh The Nguyen
    Nguyen Luong Tran
    Long Doan
    Manh Luong
    Dat Quoc Nguyen
    INTERSPEECH 2022, 2022, : 1726 - 1730
  • [22] STA: An efficient data augmentation method for low-resource neural machine translation
    Li, Fuxue
    Chi, Chuncheng
    Yan, Hong
    Liu, Beibei
    Shao, Mingzhi
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (01) : 121 - 132
  • [23] A Bilingual Templates Data Augmentation Method for Low-Resource Neural Machine Translation
    Li, Fuxue
    Liu, Beibei
    Yan, Hong
    Shao, Mingzhi
    Xie, Peijun
    Li, Jiarui
    Chi, Chuncheng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 40 - 51
  • [24] SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
    Wang, Xinyi
    Pham, Hieu
    Dai, Zihang
    Neubig, Graham
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 856 - 861
  • [25] Syntax-Aware Data Augmentation for Neural Machine Translation
    Duan, Sufeng
    Zhao, Hai
    Zhang, Dongdong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2988 - 2999
  • [26] Data Augmentation Under Scarce Condition for Neural Machine Translation
    Luo, Dan
    Shi, Shumin
    Su, Rihai
    Huang, Heyan
    PROCEEDINGS OF 2019 6TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2019, : 36 - 40
  • [27] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [28] Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation
    Kondo, Seiichiro
    Hotate, Kengo
    Hirasawa, Tosho
    Kaneko, Masahiro
    Komachi, Mamoru
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 143 - 149
  • [29] CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
    Kambhatla, Nishant
    Born, Logan
    Sarkar, Anoop
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 201 - 218
  • [30] Robust Data Augmentation for Neural Machine Translation through EVALNET
    Park, Yo-Han
    Choi, Yong-Seok
    Yun, Seung
    Kim, Sang-Hun
    Lee, Kong-Joo
    MATHEMATICS, 2023, 11 (01)