Transformer-based Machine Translation for Low-resourced Languages embedded with Language Identification

被引:5
|
作者
Sefara, Tshephisho J. [1 ]
Zwane, Skhumbuzo G. [2 ]
Gama, Nelisiwe [3 ]
Sibisi, Hlawulani [4 ]
Senoamadi, Phillemon N. [5 ]
Marivate, Vukosi [6 ]
机构
[1] CSIR, Next Generat Enterprises & Inst, Pretoria, South Africa
[2] Univ Zululand, Dept Comp Sci, Richards Bay, South Africa
[3] Univ Witwatersrand, Sch Comp Sci & Appl Math, Johannesburg, South Africa
[4] Univ Johannesburg, Dept Comp Sci, Johannesburg, South Africa
[5] Univ Zululand, Dept Math, Richards Bay, South Africa
[6] Univ Pretoria, Dept Comp Sci, Pretoria, South Africa
关键词
machine translation; low-resourced languages; neural network; language identification;
D O I
10.1109/ICTAS50802.2021.9394996
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent research on the development of machine translation (MT) models has resulted in state-of-the-art performance for many resourced European languages. However, there has been a little focus on applying these MT services to low-resourced languages. This paper presents the development of neural machine translation (NMT) for low-resourced languages of South Africa. Two MT models, JoeyNMT and transformer NMT with self-attention are trained and evaluated using BLEU score. The transformer NMT with self-attention obtained state-of-the-art performance on isiNdebele, SiSwati, Setswana, Tshivenda, isiXhosa, and Sepedi while JoeyNMT performed well on isiZulu. The MT models are embedded with language identification (LID) model that presets the language for translation models. The LID models are trained using logistic regression and multinomial naive Bayes (MNB). MNB classifier obtained an accuracy of 99% outperforming logistic regression which obtained the lowest accuracy of 97%.
引用
收藏
页码:127 / 132
页数:6
相关论文
共 50 条
  • [1] Neural Machine Translation for Low-Resourced Indian Languages
    Choudhary, Himanshu
    Rao, Shivansh
    Rohilla, Rajesh
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3610 - 3615
  • [2] Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
    Nekoto, Wilhelmina
    Marivate, Vukosi
    Matsila, Tshinondiwa
    Fasubaa, Timi
    Kolawole, Tajudeen
    Fagbohungbe, Taiwo
    Akinola, Solomon Oluwole
    Muhammad, Shamsuddee Hassan
    Kabongo, Salomon
    Osei, Salomey
    Freshia, Sackey
    Niyongabo, Rubungo Andre
    Macharm, Ricky
    Ogayo, Perez
    Ahia, Orevaoghene
    Meressa, Musie
    Adeyemi, Mofe
    Mokgesi-Selinga, Masabata
    Okegbemi, Lawrence
    Martinus, Laura Jane
    Tajudeen, Kolawole
    Degila, Kevin
    Ogueji, Kelechi
    Siminyu, Kathleen
    Kreutzer, Julia
    Webster, Jason
    Ali, Jamiil Toure
    Abbott, Jade
    Orife, Iroro
    Ezeani, Ignatius
    Dangana, Idris Abdulkabir
    Kamper, Herman
    Elsahar, Hady
    Duru, Goodness
    Kioko, Ghollah
    Murhabazi, Espoir
    van Biljon, Elan
    Whitenack, Daniel
    Onyefuluchi, Christopher
    Emezue, Chris
    Dossou, Bonaventure
    Sibanda, Blessing
    Bassey, Blessing Itoro
    Olabiyi, Ayodele
    Ramkilowan, Arshath
    Oktem, Alp
    Akinfaderin, Adewale
    Bashir, Abdallah
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2144 - 2160
  • [3] Ethical Considerations for Low-resourced Machine Translation
    Haroutunian, Levon
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 44 - 54
  • [4] Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language
    Agbesi, Victor Kwaku
    Chen, Wenyu
    Yussif, Sophyani Banaamwini
    Hossin, Md Altab
    Ukwuoma, Chiagoziem C.
    Kuadey, Noble A.
    Agbesi, Colin Collinson
    Samee, Nagwan Abdel
    Jamjoom, Mona M.
    Al-antari, Mugahed A.
    [J]. SYSTEMS, 2024, 12 (01):
  • [5] GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages
    Gaim, Fitsum
    Yang, Wonsuk
    Park, Jong C.
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6578 - 6584
  • [6] Attention-Based Neural Machine Translation Approach for Low-Resourced Indic Languages-A Case of Sanskrit to Hindi Translation
    Bakarola, Vishvajit
    Nasriwala, Jitendra
    [J]. SMART SYSTEMS: INNOVATIONS IN COMPUTING (SSIC 2021), 2022, 235 : 565 - 572
  • [7] Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments
    Ghosh, Koyel
    Senapati, Apurbalal
    [J]. NATURAL LANGUAGE PROCESSING, 2024,
  • [8] Comparing Transformer-Based Machine Translation Models for Low-Resource Languages of Colombia and Mexico
    Angel, Jason
    Manuel Meque, Abdul Gafar
    Maldonado-Sifuentes, Christian
    Sidorov, Grigori
    Gelbukh, Alexander
    [J]. ADVANCES IN SOFT COMPUTING, MICAI 2023, PT II, 2024, 14392 : 95 - 105
  • [9] BERT-Based Sentiment Analysis for Low-Resourced Languages: A Case Study of Urdu Language
    Ashraf, Muhammad Rehan
    Jana, Yasmeen
    Umer, Qasim
    Jaffar, M. Arfan
    Chung, Sungwook
    Ramay, Waheed Yousuf
    [J]. IEEE ACCESS, 2023, 11 : 110245 - 110259
  • [10] ASR DOMAIN ADAPTATION METHODS FOR LOW-RESOURCED LANGUAGES: APPLICATION TO ROMANIAN LANGUAGE
    Cucu, Horia
    Besacier, Laurent
    Burileanu, Corneliu
    Buzo, Andi
    [J]. 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 1648 - 1652