Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

被引:7
|
作者
Hai-Long Trieu [1 ]
Duc-Vu Tran [1 ]
Ittoo, Ashwin [2 ]
Le-Minh Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, Asahidai 1-1, Nomi, Ishikawa, Japan
[2] Univ Liege, QUANTOM Ctr Quantitat Methods & Operat Management, HEC Liege, Rue Louvrex 14, B-4000 Liege, Belgium
关键词
Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;
D O I
10.1145/3314936
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
    Marie, Benjamin
    Fujita, Atsushi
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2018, 17 (03)
  • [42] Comparing Transformer-Based Machine Translation Models for Low-Resource Languages of Colombia and Mexico
    Angel, Jason
    Manuel Meque, Abdul Gafar
    Maldonado-Sifuentes, Christian
    Sidorov, Grigori
    Gelbukh, Alexander
    ADVANCES IN SOFT COMPUTING, MICAI 2023, PT II, 2024, 14392 : 95 - 105
  • [43] Preservation of sentiment in machine translation of low-resource languages: a case study on Slovak movie subtitles
    Reichel, Jaroslav
    Benko, Lubomir
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [44] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
    Wei, Linyu
    Li, Miao
    Chen, Lei
    Yang, Zhenxin
    Sun, Kai
    Yuan, Man
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24
  • [45] Neural Machine Translation Advised by Statistical Machine Translation: The Case of Farsi-Spanish Bilingually Low-Resource Scenario
    Ahmadnia, Benyamin
    Kordjamshidi, Parisa
    Haffari, Gholamreza
    2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 1209 - 1213
  • [46] Improving Low-Resource Neural Machine Translation With Teacher-Free Knowledge Distillation
    Zhang, Xinlu
    Li, Xiao
    Yang, Yating
    Dong, Rui
    IEEE ACCESS, 2020, 8 : 206638 - 206645
  • [47] LenM: Improving Low-Resource Neural Machine Translation Using Target Length Modeling
    Mahsuli, Mohammad Mahdi
    Khadivi, Shahram
    Homayounpour, Mohammad Mehdi
    NEURAL PROCESSING LETTERS, 2023, 55 (07) : 9435 - 9466
  • [48] LenM: Improving Low-Resource Neural Machine Translation Using Target Length Modeling
    Mohammad Mahdi Mahsuli
    Shahram Khadivi
    Mohammad Mehdi Homayounpour
    Neural Processing Letters, 2023, 55 : 9435 - 9466
  • [49] Phonology-Augmented Statistical Transliteration for Low-Resource Languages
    Hoang Gia Ngo
    Chen, Nancy F.
    Nguyen Binh Minh
    Ma, Bin
    Li, Haizhou
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3670 - 3674
  • [50] Low-resource Neural Machine Translation: Methods and Trends
    Shi, Shumin
    Wu, Xing
    Su, Rihai
    Huang, Heyan
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (05)