A Data Augmentation Method for English-Vietnamese Neural Machine Translation

被引:4
|
作者
Pham, Nghia Luan [1 ]
Nguyen, Van Vinh [2 ]
Pham, Thang Viet [3 ]
机构
[1] Hai Phong Univ, Lib & Informat Ctr, Haiphong, Vietnam
[2] Vietnam Natl Univ, Univ Engn & Technol, Fac Informat Technol, Hanoi, Vietnam
[3] VU, Univ Med Ctr, Amsterdam, Netherlands
关键词
Machine translation; Data models; Internet; Grammar; Decoding; Symbols; Error correction; Data augmentation; Error analysis; back-translation; machine translation; Vietnamese grammatical error correction;
D O I
10.1109/ACCESS.2023.3252898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The translation quality of machine translation systems depends on the parallel corpus used for training, particularly on the quantity and quality of the corpus. However, building a high-quality and large-scale parallel corpus is complex and expensive, particularly for a specific domain-parallel corpus. Therefore, data augmentation techniques are widely used in machine translation. The input of the back-translation method is monolingual text, which is available from many sources, and therefore, this method can be easily and effectively implemented to generate synthetic parallel data. In practice, monolingual texts can be collected from different sources, in which sources from websites often contain errors in grammar and spelling, sentence mismatch, or freestyle. Therefore, the quality of the output translation is reduced, leading to a low-quality parallel corpus generated by back-translation. In this study, we proposed a method for improving the quality of monolingual texts for back-translation. Moreover, we supplemented the data by pruning the translation table. We experimented with an English-Vietnamese neural machine translation using the IWSLT2015 dataset for training and testing in the legal domain. The results showed that the proposed method can effectively augment parallel data for machine translation, thereby improving translation quality. In our experimental cases, the BLEU score increased by 16.37 points compared with the baseline system.
引用
收藏
页码:28034 / 28044
页数:11
相关论文
共 50 条
  • [1] Dictionaries for English-Vietnamese Machine Translation
    Hai, Le Manh
    Thanh, Nguyen Chanh
    Hieu, Nguyen Chi
    Tuoi, Phan Thi
    COMPUTER PROCESSING OF ORIENTAL LANGUAGES, PROCEEDINGS: BEYOND THE ORIENT: THE RESEARCH CHALLENGES AHEAD, 2006, 4285 : 363 - +
  • [2] Building an English-Vietnamese Bilingual Corpus for Machine Translation
    Quoc Hung Ngo
    Winiwarter, Werner
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 157 - 160
  • [3] An Enhanced Model for Lexical Gap Processing in English-Vietnamese Machine Translation
    Tuoi Phan Thi
    Hai Le Manh
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 105 - 108
  • [4] English-Vietnamese machine translation model based on sequence to sequence algorithm
    Jiang, Hao
    He, Yue
    Liao, Mengfan
    Jing, Yanmei
    Zhang, Chao
    PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1086 - 1091
  • [5] An Information Extraction approach to English-Vietnamese weather bulletins Machine Translation
    Son Bao Pham
    Giang Binh Tran
    Dang Duc Pham
    Kien Chi Phung
    Kien Trung Nguyen
    2009 FIRST ASIAN CONFERENCE ON INTELLIGENT INFORMATION AND DATABASE SYSTEMS, 2009, : 161 - +
  • [6] A Classifier-Based Preordering Approach for English-Vietnamese Statistical Machine Translation
    Viet Hong Tran
    Huyen Thuong Vu
    Vinh Van Nguyen
    Minh Le Nguyen
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 74 - 87
  • [7] English-Vietnamese Machine Translation of Proper Names Error Analysis and Some Proposed Solutions
    Thi Thanh Thao Phan
    Thomas, Izabella
    TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 386 - 393
  • [8] Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies
    Quoc T.N.
    Thanh H.L.
    Van H.P.
    Informatica (Slovenia), 2023, 47 (03): : 349 - 360
  • [9] A Vietnamese-English Neural Machine Translation System
    Thien Hai Nguyen
    Nguyen, Tuan-Duy H.
    Duy Phung
    Duy Tran-Cong Nguyen
    Hieu Minh Tran
    Manh Luong
    Tin Duy Vo
    Hung Hai Bui
    Dinh Phung
    Dat Quoc Nguyen
    INTERSPEECH 2022, 2022, : 5543 - 5544
  • [10] mixSeq: A Simple Data Augmentation Method for Neural Machine Translation
    Wu, Xueqing
    Xia, Yingce
    Zhu, Jinhua
    Wu, Lijun
    Xie, Shufang
    Qin, Tao
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 192 - 197