Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

被引:7
|
作者
Hai-Long Trieu [1 ]
Duc-Vu Tran [1 ]
Ittoo, Ashwin [2 ]
Le-Minh Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, Asahidai 1-1, Nomi, Ishikawa, Japan
[2] Univ Liege, QUANTOM Ctr Quantitat Methods & Operat Management, HEC Liege, Rue Louvrex 14, B-4000 Liege, Belgium
关键词
Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;
D O I
10.1145/3314936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
引用
收藏
页数:22
相关论文
共 50 条
  • [31] Optimization of data analysis models for low-resource Eurasian languages using machine translation
    Chen, Hongyan
    Yee, Kim Kyung
    INTERNET TECHNOLOGY LETTERS, 2024,
  • [32] Machine Translation into Low-resource Language Varieties
    Kumar, Sachin
    Anastasopoulos, Antonios
    Wintner, Shuly
    Tsvetkov, Yulia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 110 - 121
  • [33] A Survey on Low-Resource Neural Machine Translation
    Wang, Rui
    Tan, Xu
    Luo, Renqian
    Qin, Tao
    Liu, Tie-Yan
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4636 - 4643
  • [34] A Survey on Low-resource Neural Machine Translation
    Li H.-Z.
    Feng C.
    Huang H.-Y.
    Huang, He-Yan (hhy63@bit.edu.cn), 1600, Science Press (47): : 1217 - 1231
  • [35] Transformers for Low-resource Neural Machine Translation
    Gezmu, Andargachew Mekonnen
    Nuernberger, Andreas
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 1, 2022, : 459 - 466
  • [36] IMPROVING CAPTIONING FOR LOW-RESOURCE LANGUAGES BY CYCLE CONSISTENCY
    Wu, Yike
    Zhao, Shiwan
    Chen, Jia
    Zhang, Ying
    Yuan, Xiaojie
    Su, Zhong
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 362 - 367
  • [37] The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation
    Ahia, Orevaoghene
    Kreutzer, Julia
    Hooker, Sara
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3316 - 3333
  • [38] Lesan - Machine Translation for Low Resource Languages
    Hadgu, Asmelash Teka
    Aregawi, Abel
    Beaudoin, Adam
    NEURIPS 2021 COMPETITIONS AND DEMONSTRATIONS TRACK, VOL 176, 2021, 176 : 297 - +
  • [39] Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A Survey
    Zhang, Jinyi
    Su, Ke
    Li, Haowei
    Mao, Jiannan
    Tian, Ye
    Wen, Feng
    Guo, Chong
    Matsumoto, Tadahiro
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (06)
  • [40] Statistical Machine Translation for Bilingually Low-Resource Scenarios: A Round-Tripping Approach
    Ahmadnia, Benyamin
    Haffari, Gholamreza
    Serrano, Javier
    2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 261 - 265