Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

被引:7
|
作者
Hai-Long Trieu [1 ]
Duc-Vu Tran [1 ]
Ittoo, Ashwin [2 ]
Le-Minh Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, Asahidai 1-1, Nomi, Ishikawa, Japan
[2] Univ Liege, QUANTOM Ctr Quantitat Methods & Operat Management, HEC Liege, Rue Louvrex 14, B-4000 Liege, Belgium
关键词
Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;
D O I
10.1145/3314936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Improving neural machine translation for low-resource Indian languages using rule-based feature extraction
    Muskaan Singh
    Ravinder Kumar
    Inderveer Chana
    Neural Computing and Applications, 2021, 33 : 1103 - 1122
  • [22] Multilingual neural machine translation for low-resource languages by twinning important nodes
    Qorbani, Abouzar
    Ramezani, Reza
    Baraani, Ahmad
    Kazemi, Arefeh
    NEUROCOMPUTING, 2025, 634
  • [23] Survey of Low-Resource Machine Translation
    Haddow, Barry
    Bawden, Rachel
    Barone, Antonio Valerio Miceli
    Helcl, Jindrich
    Birch, Alexandra
    COMPUTATIONAL LINGUISTICS, 2022, 48 (03) : 673 - 732
  • [24] Improving neural machine translation for low-resource Indian languages using rule-based feature extraction
    Singh, Muskaan
    Kumar, Ravinder
    Chana, Inderveer
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (04): : 1103 - 1122
  • [25] DRA: dynamic routing attention for neural machine translation with low-resource languages
    Wang, Zhenhan
    Song, Ran
    Yu, Zhengtao
    Mao, Cunli
    Gao, Shengxiang
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024,
  • [26] Understanding and Improving Low-Resource Neural Machine Translation with Shallow Features
    Sun, Yanming
    Liu, Xuebo
    Wong, Derek F.
    Lin, Yuchu
    Li, Bei
    Zhan, Runzhe
    Chao, Lidia S.
    Zhang, Min
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 227 - 239
  • [27] Translation Memories as Baselines for Low-Resource Machine Translation
    Knowles, Rebecca
    Littell, Patrick
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6759 - 6767
  • [28] Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages
    Gezmu, Andargachew Mekonnen
    Nuenberger, Andreas
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
  • [29] Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation
    Elbayad, Maha
    Sun, Anna
    Bhosale, Shruti
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14237 - 14253
  • [30] Neural machine translation of low-resource languages using SMT phrase pair injection
    Sen, Sukanta
    Hasanuzzaman, Mohammed
    Ekbal, Asif
    Bhattacharyya, Pushpak
    Way, Andy
    NATURAL LANGUAGE ENGINEERING, 2021, 27 (03) : 271 - 292