Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

被引：7

作者：

Hai-Long Trieu ^{[1
]}

Duc-Vu Tran ^{[1
]}

Ittoo, Ashwin ^{[2
]}

Le-Minh Nguyen ^{[1
]}

机构：

[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, Asahidai 1-1, Nomi, Ishikawa, Japan

[2] Univ Liege, QUANTOM Ctr Quantitat Methods & Operat Management, HEC Liege, Rue Louvrex 14, B-4000 Liege, Belgium

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2019年 / 18卷 / 03期

关键词：

Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;

D O I：

10.1145/3314936

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.

引用

页数：22

共 50 条

[41] Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
Marie, Benjamin
Fujita, Atsushi
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2018, 17 (03)
[42] Comparing Transformer-Based Machine Translation Models for Low-Resource Languages of Colombia and Mexico
Angel, Jason
Manuel Meque, Abdul Gafar
Maldonado-Sifuentes, Christian
Sidorov, Grigori
Gelbukh, Alexander
ADVANCES IN SOFT COMPUTING, MICAI 2023, PT II, 2024, 14392 : 95 - 105
[43] Preservation of sentiment in machine translation of low-resource languages: a case study on Slovak movie subtitles
Reichel, Jaroslav
Benko, Lubomir
LANGUAGE RESOURCES AND EVALUATION, 2024,
[44] Extracting Bilingual Multi-word Expressions for Low-resource Statistical Machine Translation
Wei, Linyu
Li, Miao
Chen, Lei
Yang, Zhenxin
Sun, Kai
Yuan, Man
PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 21 - 24
[45] Neural Machine Translation Advised by Statistical Machine Translation: The Case of Farsi-Spanish Bilingually Low-Resource Scenario
Ahmadnia, Benyamin
Kordjamshidi, Parisa
Haffari, Gholamreza
2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 1209 - 1213
[46] Improving Low-Resource Neural Machine Translation With Teacher-Free Knowledge Distillation
Zhang, Xinlu
Li, Xiao
Yang, Yating
Dong, Rui
IEEE ACCESS, 2020, 8 : 206638 - 206645
[47] LenM: Improving Low-Resource Neural Machine Translation Using Target Length Modeling
Mahsuli, Mohammad Mahdi
Khadivi, Shahram
Homayounpour, Mohammad Mehdi
NEURAL PROCESSING LETTERS, 2023, 55 (07) : 9435 - 9466
[48] LenM: Improving Low-Resource Neural Machine Translation Using Target Length Modeling
Mohammad Mahdi Mahsuli
Shahram Khadivi
Mohammad Mehdi Homayounpour
Neural Processing Letters, 2023, 55 : 9435 - 9466
[49] Phonology-Augmented Statistical Transliteration for Low-Resource Languages
Hoang Gia Ngo
Chen, Nancy F.
Nguyen Binh Minh
Ma, Bin
Li, Haizhou
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3670 - 3674
[50] Low-resource Neural Machine Translation: Methods and Trends
Shi, Shumin
Wu, Xing
Su, Rihai
Huang, Heyan
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (05)

← 1 2 3 4 5 →