RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation

被引：0

作者：

Wang, Yan ^{[1
]}

Zeng, Yawen ^{[2
]}

Liang, Junjie ^{[1
]}

Xing, Xiaofen ^{[1
]}

Xu, Jin ^{[1
]}

Xu, Xiangmin ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

[2] ByteDance AI Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

multi-modal machine translation; multi-modal prompt learning; multi-modal dictionary;

D O I：

10.1145/3652583.3658018

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As an extension of machine translation, the primary objective of multi-modal machine translation is to optimize the utilization of visual information. Technically, image information is integrated into multi-modal fusion and alignment as an auxiliary modality through concepts or latent semantics, which are typically based on the Transformer framework. However, current approaches often ignore one modality to design numerous handcrafted features (e.g. visual concept extraction) and require training of all parameters in their framework. Therefore, it is worthwhile to explore multi-modal concepts or features to enhance performance and an efficient approach to incorporate visual information with minimal cost. Meanwhile, with the development of multi-modal large language models (MLLMs), they are faced with the visual hallucination issue of compromising performance, despite their powerful capabilities. Inspired by pioneering techniques in the multi-modal field, such as prompt learning and MLLMs, this paper innovatively explores the possibility of applying multi-modal prompt learning to this multi-modal machine translation task. Our framework offers three key advantages: it establishes a robust connection between visual concepts and translation processes, requires a minimum of 1.46M parameters for training, and can be seamlessly integrated into any existing framework by retrieving a multi-modal dictionary. Specifically, we propose two prompt-guided strategies: a learnable prompt-refined module and a heuristic prompt-refined module. Among them, the learnable strategy utilizes off-the-shelf pre-trained models, while the heuristic strategy constrains the hallucination problem via concept retrieval. Our experiments on two real-world benchmark datasets demonstrate that our proposed method outperforms all competitors.

引用

页码：860 / 868

页数：9

共 50 条

[41] Reliable Multi-modal Learning: A Survey
Yang Y.
Zhan D.-C.
Jiang Y.
Xiong H.
Ruan Jian Xue Bao/Journal of Software, 2021, 32 (04): : 1067 - 1081
[42] Multi-Modal Meta Continual Learning
Gai, Sibo
Chen, Zhengyu
Wang, Donglin
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[43] Learning of Multi-Modal Stimuli in Hawkmoths
Balkenius, Anna
Dacke, Marie
PLOS ONE, 2013, 8 (07):
[44] Multi-modal Correlation Modeling and Ranking for Retrieval
Zhang, Hong
Meng, Fanlian
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 : 637 - 646
[45] MULTI-MODAL LEARNING FOR GESTURE RECOGNITION
Cao, Congqi
Zhang, Yifan
Lu, Hanqing
2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2015,
[46] Learning multi-modal control programs
Mehta, TR
Egerstedt, M
HYBRID SYSTEMS: COMPUTATION AND CONTROL, 2005, 3414 : 466 - 479
[47] Imagery in multi-modal object learning
Jüttner, M
Rentschler, I
BEHAVIORAL AND BRAIN SCIENCES, 2002, 25 (02) : 197 - +
[48] Multi-modal Subspace Learning with Dropout regularization for Cross-modal Recognition and Retrieval
Cao, Guanqun
Waris, Muhammad Adeel
Iosifidis, Alexandros
Gabbouj, Moncef
2016 SIXTH INTERNATIONAL CONFERENCE ON IMAGE PROCESSING THEORY, TOOLS AND APPLICATIONS (IPTA), 2016,
[49] Multi-modal Subspace Learning with Joint Graph Regularization for Cross-modal Retrieval
Wang, Kaiye
Wang, Wei
He, Ran
Wang, Liang
Tan, Tieniu
2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 236 - 240
[50] Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval and Analysis
Gu, Xiaoling
Wong, Yongkang
Shou, Lidan
Peng, Pai
Chen, Gang
Kankanhalli, Mohan S.
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (06) : 1524 - 1537

← 1 2 3 4 5 →