A Data Augmentation Method for English-Vietnamese Neural Machine Translation

被引:4
|
作者
Pham, Nghia Luan [1 ]
Nguyen, Van Vinh [2 ]
Pham, Thang Viet [3 ]
机构
[1] Hai Phong Univ, Lib & Informat Ctr, Haiphong, Vietnam
[2] Vietnam Natl Univ, Univ Engn & Technol, Fac Informat Technol, Hanoi, Vietnam
[3] VU, Univ Med Ctr, Amsterdam, Netherlands
关键词
Machine translation; Data models; Internet; Grammar; Decoding; Symbols; Error correction; Data augmentation; Error analysis; back-translation; machine translation; Vietnamese grammatical error correction;
D O I
10.1109/ACCESS.2023.3252898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The translation quality of machine translation systems depends on the parallel corpus used for training, particularly on the quantity and quality of the corpus. However, building a high-quality and large-scale parallel corpus is complex and expensive, particularly for a specific domain-parallel corpus. Therefore, data augmentation techniques are widely used in machine translation. The input of the back-translation method is monolingual text, which is available from many sources, and therefore, this method can be easily and effectively implemented to generate synthetic parallel data. In practice, monolingual texts can be collected from different sources, in which sources from websites often contain errors in grammar and spelling, sentence mismatch, or freestyle. Therefore, the quality of the output translation is reduced, leading to a low-quality parallel corpus generated by back-translation. In this study, we proposed a method for improving the quality of monolingual texts for back-translation. Moreover, we supplemented the data by pruning the translation table. We experimented with an English-Vietnamese neural machine translation using the IWSLT2015 dataset for training and testing in the legal domain. The results showed that the proposed method can effectively augment parallel data for machine translation, thereby improving translation quality. In our experimental cases, the BLEU score increased by 16.37 points compared with the baseline system.
引用
收藏
页码:28034 / 28044
页数:11
相关论文
共 50 条
  • [31] Projecting dependency syntax labels from english into Vietnamese in English-Vietnamese bilingual corpus
    Tran P.
    Duong V.-D.
    Dinh D.
    Vo B.
    Nguyen H.
    Nguyen L.H.B.
    International Journal of Intelligent Information and Database Systems, 2020, 13 (01) : 17 - 32
  • [32] Addressing data scarcity issue for English-Mizo neural machine translation using data augmentation and language model
    Khenglawt, Vanlalmuansangi
    Laskar, Sahinur Rahman
    Pakray, Partha
    Khan, Ajoy Kumar
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (03) : 6313 - 6323
  • [33] Phrase-Based Compressive Summarization for English-Vietnamese
    Tung Le
    Le-Minh Nguyen
    Shimazu, Akira
    Dinh Dien
    INTEGRATED UNCERTAINTY IN KNOWLEDGE MODELLING AND DECISION MAKING, IUKM 2016, 2016, 9978 : 331 - 342
  • [34] Pronouncing English-Vietnamese dictionary Vietnamese-English dictionary - Kong,LB, Khanh,LB
    Caggiano, A
    Charles, J
    Mosley, S
    LIBRARY JOURNAL, 1996, 121 (07) : 78 - 78
  • [35] A Language-Driven Data Augmentation Method for Mongolian-Chinese Neural Machine Translation
    Wei, Xuerong
    Ren, Qing-Dao-Er-Ji
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 297 - 302
  • [36] An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation
    Thi-Vinh Ngo
    Phuong-Thai Nguyen
    Van Vinh Nguyen
    Thanh-Le Ha
    Le-Minh Nguyen
    APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
  • [37] Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents
    Ha Nguyen Tien
    Quyen Ngo The
    Huyen Nguyen Thi Minh
    Linh Ha My
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 56 - 62
  • [38] On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese
    Vu, Huan
    Bui, Ngoc Dung
    JOURNAL OF INFORMATION AND TELECOMMUNICATION, 2023, 7 (02) : 241 - 253
  • [39] Paraphrase Based Data Augmentation For Chinese-English Medical Machine Translation
    An Bo
    Long Congjun
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2022, 44 (01) : 118 - 126
  • [40] A Mongolian-Chinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation
    Zhang, Huinuan
    Ji, Yatu
    Wu, Nier
    Lu, Min
    APPLIED SCIENCES-BASEL, 2024, 14 (08):