Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

被引:0
|
作者
Nguyen, Xuan-Phi [1 ,3 ]
Joty, Shafiq [1 ,2 ]
Kui, Wu [3 ]
Aw, Ai Ti [3 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Salesforce Res, Palo Alto, CA USA
[3] ASTAR Singapore, Inst Infocomm Res I2R, Singapore, Singapore
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Numerous recent work on unsupervised machine translation (UMT) implies that competent unsupervised translations of low-resource and unrelated languages, such as Nepali or Sinhala, are only possible if the model is trained in a massive multilingual environment, where these low-resource languages are mixed with high-resource counterparts. Nonetheless, while the high-resource languages greatly help kick-start the target low-resource translation tasks, the language discrepancy between them may hinder their further improvement. In this work, we propose a simple refinement procedure to separate languages from a pre-trained multilingual UMT model for it to focus on only the target low-resource task. Our method achieves the state of the art in the fully unsupervised translation tasks of English to Nepali, Sinhala, Gujarati, Latvian, Estonian and Kazakh, with BLEU score gains of 3.5, 3.5, 3.3, 4.1, 4.2, and 3.3, respectively. Our codebase is available at github.com/nxphi47/refine_unsup_multilingual_mt.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Lexicon-based fine-tuning of multilingual language models for low-resource language sentiment analysis
    Dhananjaya, Vinura
    Ranathunga, Surangika
    Jayasena, Sanath
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (05) : 1116 - 1125
  • [42] Optimizing Multilingual Sentiment Analysis in Low-Resource Languages with Adaptive Pretraining and Strategic Language Selection
    Raychawdhary, Nilanjana
    Das, Amit
    Bhattacharya, Sutanu
    Dozier, Gerry
    Seals, Cheryl D.
    2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,
  • [43] Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR)
    Cheema, Musa Dildar Ahmed
    Shaiq, Mohammad Daniyal
    Mirza, Farhaan
    Kamal, Ali
    Naeem, M. Asif
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [44] Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition
    Feng, Siyuan
    Tu, Ming
    Xia, Rui
    Huang, Chuanzeng
    Wang, Yuxuan
    INTERSPEECH 2023, 2023, : 1384 - 1388
  • [45] A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages
    Zennaki, O.
    Semmar, N.
    Besacier, L.
    NATURAL LANGUAGE ENGINEERING, 2019, 25 (01) : 43 - 67
  • [46] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Joyanta Basu
    Soma Khan
    Rajib Roy
    Tapan Kumar Basu
    Swanirbhar Majumder
    Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
  • [47] Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR)
    Cheema M.D.A.
    Shaiq M.D.
    Mirza F.
    Kamal A.
    Naeem M.A.
    PeerJ Computer Science, 2024, 10 : 1 - 24
  • [48] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Basu, Tapan Kumar
    Majumder, Swanirbhar
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
  • [49] A Study on Low-resource Language Identification
    Qi, Zhaodi
    Ma, Yong
    Gu, Mingliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
  • [50] Low-Resource Translation Quality Estimation for Estonian
    Yankovskaya, Elizaveta
    Fishel, Mark
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2018, 2018, 307 : 175 - 182