Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering

被引:0
|
作者
Qian, Zi [1 ,2 ]
Wang, Xin [1 ]
Duan, Xuguang [1 ]
Qin, Pengda [2 ]
Li, Yuhong [2 ]
Zhu, Wenwu [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, BNRist, Beijing, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
LANGUAGE;
D O I
10.1109/ICCV51070.2023.00276
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the real world, a desirable Visual Question Answering model is expected to provide correct answers to new questions and images in a continual setting (recognized as CL-VQA). However, existing works formulate CL-VQA from a vision-only or language-only perspective, and straightforwardly apply the uni-modal continual learning ( CL) strategies to this multi-modal task, which is improper and suboptimal. On the one hand, such a partial formulation may result in limited evaluations. On the other hand, neglecting the interactions between modalities will lead to poor performance. To tackle these challenging issues, we propose a comprehensive formulation for CL-VQA from the perspective of multi-modal vision-language fusion. Based on our formulation, we further propose MulTi-Modal PRompt LearnIng with DecouPLing bEfore InTeraction (TRIPLET), a novel approach that builds on a pre-trained vision-language model and consists of decoupled prompts and prompt interaction strategies to capture the complex interactions between modalities. In particular, decoupled prompts contain learnable parameters that are decoupled w.r.t different aspects, and the prompt interaction strategies are in charge of modeling interactions between inputs and prompts. Additionally, we build two CL-VQA benchmarks for a more comprehensive evaluation. Extensive experiments demonstrate that our TRIPLET outperforms state-of-the-art methods in both uni-modal and multi-modal continual settings for CL-VQA.
引用
收藏
页码:2941 / 2950
页数:10
相关论文
共 50 条
  • [1] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [2] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [3] Multi-scale relation reasoning for multi-modal Visual Question Answering
    Wu, Yirui
    Ma, Yuntao
    Wan, Shaohua
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
  • [4] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [5] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    [J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [6] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [7] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (23) : 1 - 15
  • [8] Hierarchical deep multi-modal network for medical visual question answering
    Gupta, Deepak
    Suman, Swati
    Ekbal, Asif
    [J]. Expert Systems with Applications, 2021, 164
  • [9] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [10] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
    Hu, Xinyue
    Gu, Lin
    Kobayashi, Kazuma
    Liu, Liangchen
    Zhang, Mengliang
    Harada, Tatsuya
    Summers, Ronald M.
    Zhu, Yingying
    [J]. MEDICAL IMAGE ANALYSIS, 2024, 97