Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

被引:7
|
作者
Hai-Long Trieu [1 ]
Duc-Vu Tran [1 ]
Ittoo, Ashwin [2 ]
Le-Minh Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, Asahidai 1-1, Nomi, Ishikawa, Japan
[2] Univ Liege, QUANTOM Ctr Quantitat Methods & Operat Management, HEC Liege, Rue Louvrex 14, B-4000 Liege, Belgium
关键词
Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;
D O I
10.1145/3314936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Extremely low-resource neural machine translation for Asian languages
    Rubino, Raphael
    Marie, Benjamin
    Dabre, Raj
    Fujita, Atushi
    Utiyama, Masao
    Sumita, Eiichiro
    MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
  • [2] Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
    Duh, Kevin
    McNamee, Paul
    Post, Matt
    Thompson, Brian
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2667 - 2675
  • [3] OCR Improves Machine Translation for Low-Resource Languages
    Ignat, Oana
    Maillard, Jean
    Chaudhary, Vishrav
    Guzman, Francisco
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1164 - 1174
  • [4] Neural Machine Translation for Low-resource Languages: A Survey
    Ranathunga, Surangika
    Lee, En-Shiun Annie
    Skenduli, Marjana Prifti
    Shekhar, Ravi
    Alam, Mehreen
    Kaur, Rishemjit
    ACM COMPUTING SURVEYS, 2023, 55 (11)
  • [5] Neighbors helping the poor: improving low-resource machine translation using related languages
    Pourdamghani, Nima
    Knight, Kevin
    MACHINE TRANSLATION, 2019, 33 (03) : 239 - 258
  • [6] Introduction to the second issue on machine translation for low-resource languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2021, 35 (01) : 1 - 2
  • [7] Machine Translation in Low-Resource Languages by an Adversarial Neural Network
    Sun, Mengtao
    Wang, Hao
    Pasquine, Mark
    Hameed, Ibrahim A.
    APPLIED SCIENCES-BASEL, 2021, 11 (22):
  • [8] Decoding Strategies for Improving Low-Resource Machine Translation
    Park, Chanjun
    Yang, Yeongwook
    Park, Kinam
    Lim, Heuiseok
    ELECTRONICS, 2020, 9 (10) : 1 - 15
  • [9] Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
    Przystupa, Michael
    Abdul-Mageed, Muhammad
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 224 - 235
  • [10] Introduction to the Special Issue on Machine Translation for Low-Resource Languages
    Liu, Chao-Hong
    Karakanta, Alina
    Tong, Audrey N.
    Aulov, Oleg
    Soboroff, Ian M.
    Washington, Jonathan
    Zhao, Xiaobing
    MACHINE TRANSLATION, 2020, 34 (04) : 247 - 249