The neural machine translation models for the low-resource Kazakh-English language pair

被引：7

作者：

Karyukin, Vladislav ^{[1
]}

Rakhimova, Diana ^{[1
,2
]}

Karibayeva, Aidana ^{[1
]}

Turganbayeva, Aliya ^{[1
]}

Turarbek, Asem ^{[1
]}

机构：

[1] Al Farabi Kazakh Natl Univ, Dept Informat Syst, Alma Ata, Kazakhstan

[2] Inst Informat & Computat Technol, Alma Ata, Kazakhstan

来源：

PEERJ COMPUTER SCIENCE | 2023年 / 9卷

关键词：

Neural machine translation; Forward translation; Backward translation; Seq2Seq; RNN; BRNN; Transformer; OpenNMT; English; Kazakh;

D O I：

10.7717/peerj-cs.1224

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The development of the machine translation field was driven by people's need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh-English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh-English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.

引用

页数：20

共 50 条

[41] A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation
Li, Yu
Li, Xiao
Yang, Yating
Dong, Rui
INFORMATION, 2020, 11 (05)
[42] Semantic Perception-Oriented Low-Resource Neural Machine Translation
Wu, Nier
Hou, Hongxu
Li, Haoran
Chang, Xin
Jia, Xiaoning
MACHINE TRANSLATION, CCMT 2021, 2021, 1464 : 51 - 62
[43] A Content Word Augmentation Method for Low-Resource Neural Machine Translation
Li, Fuxue
Zhao, Zhongchao
Chi, Chuncheng
Yan, Hong
Zhang, Zhen
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT IV, 2023, 14089 : 720 - 731
[44] Understanding and Improving Low-Resource Neural Machine Translation with Shallow Features
Sun, Yanming
Liu, Xuebo
Wong, Derek F.
Lin, Yuchu
Li, Bei
Zhan, Runzhe
Chao, Lidia S.
Zhang, Min
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 227 - 239
[45] Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
Duh, Kevin
McNamee, Paul
Post, Matt
Thompson, Brian
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2667 - 2675
[46] Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings
Kalimuthu, Marimuthu
Barz, Michael
Sonntag, Daniel
FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 1 - 10
[47] An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
Mueller, Aaron
Nicolai, Garrett
McCarthy, Arya D.
Lewis, Dylan
Wu, Winston
Yarowsky, David
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3710 - 3718
[48] Towards a Low-Resource Neural Machine Translation for Indigenous Languages in Canada
Ngoc Tan Le
Sadat, Fatiha
TRAITEMENT AUTOMATIQUE DES LANGUES, 2021, 62 (03): : 39 - 63
[49] Neural machine translation for low-resource languages without parallel corpora
Karakanta, Alina
Dehdari, Jon
van Genabith, Josef
MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189
[50] Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation
Unanue I.J.
Borzeshi E.Z.
Piccardi M.
IEEE Transactions on Artificial Intelligence, 2023, 4 (03): : 450 - 463

← 1 2 3 4 5 →