Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

被引:0
|
作者
Saxena, Shefali [1 ]
Gupta, Ayush [1 ]
Daniel, Philemon [1 ]
机构
[1] Natl Inst Technol Hamirpur, Dept Elect & Commun Engn, Hamirpur, India
关键词
Data Augmentation; Low-resource language; Machine Translation; Evaluation;
D O I
10.1007/s11042-023-18086-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the fast advancement of AI technology in recent years, many excellent Data Augmentation (DA) approaches have been investigated to increase data efficiency in Natural Language Processing (NLP). The reliance on a large amount of data prohibits NLP models from performing tasks such as labelling enormous amounts of textual data, which require a substantial amount of time, money, and human resources; hence, a better model requires more data. Text DA technique rectifies the data by extending it, enhancing the model's accuracy and resilience. A novel lexical-based matching approach is the cornerstone of this work; it is used to improve the quality of the Machine Translation (MT) system. This study includes resource-rich Indic (i.e., Indo-Aryan and Dravidian language families) to examine the proposed techniques. Extensive experiments on a range of language pairs depict that the proposed method significantly improves scores in the enhanced dataset compared to the baseline system's BLEU, METEOR and ROUGE evaluation scores.
引用
收藏
页码:64255 / 64269
页数:15
相关论文
共 50 条
  • [21] Transfer learning based on lexical constraint mechanism in low-resource machine translation*
    Jiang, Hao
    Zhang, Chao
    Xin, Zhihui
    Huang, Xiaoqiao
    Li, Chengli
    Tai, Yonghang
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 100
  • [22] Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
    Duh, Kevin
    McNamee, Paul
    Post, Matt
    Thompson, Brian
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2667 - 2675
  • [23] Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
    Ramesh, Akshai
    Uhana, Haque Usuf
    Parthasarathy, Venkatesh Balavadhani
    Haque, Rejwanul
    Way, Andy
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [24] Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach
    Sanchez-Cartagena, Victor M.
    Espla-Gomis, Miquel
    Antonio Perez-Ortiz, Juan
    Sanchez-Martinez, Felipe
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8502 - 8516
  • [25] A Data Augmentation Method Based on Sub-tree Exchange for Low-Resource Neural Machine Translation
    Chi, Chuncheng
    Li, Fuxue
    Yan, Hong
    Guan, Hui
    Zhao, Zhongchao
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT IV, 2023, 14089 : 646 - 657
  • [26] The Task of Post-Editing Machine Translation for the Low-Resource Language
    Rakhimova, Diana
    Karibayeva, Aidana
    Turarbek, Assem
    APPLIED SCIENCES-BASEL, 2024, 14 (02):
  • [27] Unsupervised Multimodal Machine Translation for Low-resource Distant Language Pairs
    Tayir, Turghun
    Li, Lin
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (04)
  • [28] Efficient Adaptation: Enhancing Multilingual Models for Low-Resource Language Translation
    Sel, Ilhami
    Hanbay, Davut
    MATHEMATICS, 2024, 12 (19)
  • [29] Efficient Low-Resource Neural Machine Translation with Reread and Feedback Mechanism
    Yu, Zhiqiang
    Yu, Zhengtao
    Guo, Junjun
    Huang, Yuxin
    Wen, Yonghua
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (03)
  • [30] Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation
    Chen, Guanhua
    Chen, Yun
    Wang, Yong
    Li, Victor O. K.
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3587 - 3593