RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation

被引:0
|
作者
Wang, Yan [1 ]
Zeng, Yawen [2 ]
Liang, Junjie [1 ]
Xing, Xiaofen [1 ]
Xu, Jin [1 ]
Xu, Xiangmin [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] ByteDance AI Lab, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-modal machine translation; multi-modal prompt learning; multi-modal dictionary;
D O I
10.1145/3652583.3658018
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As an extension of machine translation, the primary objective of multi-modal machine translation is to optimize the utilization of visual information. Technically, image information is integrated into multi-modal fusion and alignment as an auxiliary modality through concepts or latent semantics, which are typically based on the Transformer framework. However, current approaches often ignore one modality to design numerous handcrafted features (e.g. visual concept extraction) and require training of all parameters in their framework. Therefore, it is worthwhile to explore multi-modal concepts or features to enhance performance and an efficient approach to incorporate visual information with minimal cost. Meanwhile, with the development of multi-modal large language models (MLLMs), they are faced with the visual hallucination issue of compromising performance, despite their powerful capabilities. Inspired by pioneering techniques in the multi-modal field, such as prompt learning and MLLMs, this paper innovatively explores the possibility of applying multi-modal prompt learning to this multi-modal machine translation task. Our framework offers three key advantages: it establishes a robust connection between visual concepts and translation processes, requires a minimum of 1.46M parameters for training, and can be seamlessly integrated into any existing framework by retrieving a multi-modal dictionary. Specifically, we propose two prompt-guided strategies: a learnable prompt-refined module and a heuristic prompt-refined module. Among them, the learnable strategy utilizes off-the-shelf pre-trained models, while the heuristic strategy constrains the hallucination problem via concept retrieval. Our experiments on two real-world benchmark datasets demonstrate that our proposed method outperforms all competitors.
引用
收藏
页码:860 / 868
页数:9
相关论文
共 50 条
  • [31] Learning Nonrigid Deformations for Constrained Multi-modal Image Registration
    Onofrey, John A.
    Staib, Lawrence H.
    Papademetris, Xenophon
    MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION (MICCAI 2013), PT III, 2013, 8151 : 171 - 178
  • [32] Effective deep learning-based multi-modal retrieval
    Wei Wang
    Xiaoyan Yang
    Beng Chin Ooi
    Dongxiang Zhang
    Yueting Zhuang
    The VLDB Journal, 2016, 25 : 79 - 101
  • [33] Effective deep learning-based multi-modal retrieval
    Wang, Wei
    Yang, Xiaoyan
    Ooi, Beng Chin
    Zhang, Dongxiang
    Zhuang, Yueting
    VLDB JOURNAL, 2016, 25 (01): : 79 - 101
  • [34] Multi-Modal 2020: Multi-Modal Argumentation 30 Years Later
    Gilbert, Michael A.
    INFORMAL LOGIC, 2022, 42 (03): : 487 - 506
  • [35] Multi-modal recursive prompt learning with mixup embedding for generalization recognition
    Jia, Yunpeng
    Ye, Xiufen
    Liu, Yusong
    Guo, Shuxiang
    KNOWLEDGE-BASED SYSTEMS, 2024, 294
  • [36] Multi-modal information retrieval using FINT
    van Zaanen, M
    de Croon, G
    MULTILINGUAL INFORMATION ACCESS FOR TEXT, SPEECH AND IMAGES, 2005, 3491 : 728 - +
  • [37] Privacy Protection in Deep Multi-modal Retrieval
    Zhang, Peng-Fei
    Li, Yang
    Huang, Zi
    Yin, Hongzhi
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 634 - 643
  • [38] CLIP Multi-modal Hashing for Multimedia Retrieval
    Zhu, Jian
    Sheng, Mingkai
    Huang, Zhangmin
    Chang, Jingfei
    Jiang, Jinling
    Long, Jian
    Luo, Cheng
    Liu, Lei
    MULTIMEDIA MODELING, MMM 2025, PT I, 2025, 15520 : 195 - 205
  • [39] Learning in an Inclusive Multi-Modal Environment
    Graham, Deryn
    Benest, Ian
    Nicholl, Peter
    JOURNAL OF CASES ON INFORMATION TECHNOLOGY, 2010, 12 (03) : 28 - 44
  • [40] Multi-modal Information Integration for Document Retrieval
    Hassan, Ehtesham
    Chaudhury, Santanu
    Gopal, M.
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1200 - 1204