The neural machine translation models for the low-resource Kazakh-English language pair

被引:7
|
作者
Karyukin, Vladislav [1 ]
Rakhimova, Diana [1 ,2 ]
Karibayeva, Aidana [1 ]
Turganbayeva, Aliya [1 ]
Turarbek, Asem [1 ]
机构
[1] Al Farabi Kazakh Natl Univ, Dept Informat Syst, Alma Ata, Kazakhstan
[2] Inst Informat & Computat Technol, Alma Ata, Kazakhstan
关键词
Neural machine translation; Forward translation; Backward translation; Seq2Seq; RNN; BRNN; Transformer; OpenNMT; English; Kazakh;
D O I
10.7717/peerj-cs.1224
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of the machine translation field was driven by people's need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh-English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh-English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Improving Low-Resource Kazakh-English and Turkish-English Neural Machine Translation Using Transfer Learning and Part of Speech Tags
    Yazar, Bilge Kagan
    Kilic, Erdal
    IEEE ACCESS, 2025, 13 : 32341 - 32356
  • [2] Learning Word Alignment Models for Kazakh-English Machine Translation
    Kartbayev, Amandyk
    INTEGRATED UNCERTAINTY IN KNOWLEDGE MODELLING AND DECISION MAKING, IUKM 2015, 2015, 9376 : 326 - 335
  • [3] Improved neural machine translation for low-resource English-Assamese pair
    Laskar, Sahinur Rahman
    Khilji, Abdullah Faiz Ur Rahman
    Pakray, Partha
    Bandyopadhyay, Sivaji
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (05) : 4727 - 4738
  • [4] Automatic Machine Translation of Poetry and a Low-Resource Language Pair
    Dunder, I
    Seljan, S.
    Pavlovski, M.
    2020 43RD INTERNATIONAL CONVENTION ON INFORMATION, COMMUNICATION AND ELECTRONIC TECHNOLOGY (MIPRO 2020), 2020, : 1034 - 1039
  • [5] Investigating Unsupervised Neural Machine Translation for Low-resource Language Pair English-Mizo via Lexically Enhanced Pre-trained Language Models
    Lalrempuii, Candy
    Soni, Badal
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (08)
  • [6] Language Model Prior for Low-Resource Neural Machine Translation
    Baziotis, Christos
    Haddow, Barry
    Birch, Alexandra
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7622 - 7634
  • [7] Improving neural machine translation by integrating transliteration for low-resource English-Assamese language
    Nath, Basab
    Sarkar, Sunita
    Mukhopadhyay, Somnath
    Roy, Arindam
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 306 - 327
  • [8] The University of Maryland's Kazakh-English Neural Machine Translation System at WMT19
    Briakou, Eleftheria
    Carpuat, Marine
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), 2019, : 134 - 140
  • [9] Machine Translation into Low-resource Language Varieties
    Kumar, Sachin
    Anastasopoulos, Antonios
    Wintner, Shuly
    Tsvetkov, Yulia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 110 - 121
  • [10] A Survey on Low-Resource Neural Machine Translation
    Wang, Rui
    Tan, Xu
    Luo, Renqian
    Qin, Tao
    Liu, Tie-Yan
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4636 - 4643