Efficient Multimodal Fusion via Interactive Prompting

被引:12
|
作者
Li, Yaowei [1 ]
Quan, Ruijie [2 ]
Zhu, Linchao [2 ]
Yang, Yi [2 ]
机构
[1] Univ Technol Sydney, ReLER, AAII, Sydney, NSW, Australia
[2] Zhejiang Univ, CCAI, Hangzhou, Peoples R China
基金
澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR52729.2023.00256
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multimodal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of finetuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pretrained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.
引用
下载
收藏
页码:2604 / 2613
页数:10
相关论文
共 50 条
  • [1] Modular and Parameter-Efficient Multimodal Fusion with Prompting
    Liang, Sheng
    Zhao, Mengjie
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2976 - 2985
  • [2] Interactive Reinforcement Learning With Bayesian Fusion of Multimodal Advice
    Trick, Susanne
    Herbert, Franziska
    Rothkopf, Constantin A.
    Koert, Dorothea
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7558 - 7565
  • [3] MIFNet: multimodal interactive fusion network for medication recommendation
    Huo, Jiazhen
    Hong, Zhikai
    Chen, Mingzhou
    Duan, Yongrui
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (09): : 12313 - 12345
  • [4] Interactive System for Video Summarization Based on Multimodal Fusion
    Zheng Li
    Xiaobing Du
    Cuixia Ma
    Yanfeng Li
    Hongan Wang
    Journal of Beijing Institute of Technology, 2019, 28 (01) : 27 - 34
  • [5] Interactive System for Video Summarization Based on Multimodal Fusion
    Li Z.
    Du X.
    Ma C.
    Li Y.
    Wang H.
    Journal of Beijing Institute of Technology (English Edition), 2019, 28 (01): : 27 - 34
  • [6] Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
    Wasim, Syed Talal
    Naseer, Muzammal
    Khan, Salman
    Khan, Fahad Shahbaz
    Shah, Mubarak
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23034 - 23044
  • [7] An efficient fusion strategy for multimodal biometric system
    Agrawal, Nitin
    Mehrotra, Hunny
    Gupta, Phalguni
    Hwang, C. Jinshong
    VISAPP 2007: PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOLUME IU/MTSV, 2007, : 178 - +
  • [8] High Level Data Fusion on a Multimodal Interactive Application Platform
    Mendonca, Hildeberto
    EICS'09: PROCEEDINGS OF THE ACM SIGCHI SYMPOSIUM ON ENGINEERING INTERACTIVE COMPUTING SYSTEMS, 2009, : 333 - 336
  • [9] High level data fusion on a multimodal interactive applications platform
    Vybornova, Olga
    Mendonca, Hildeberto
    Lawson, Jean-Yves Lionel
    Macq, Benoit
    ISM: 2008 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, 2008, : 493 - 494
  • [10] Ubiquitous interactive video editing via multimodal annotations
    Pimentel, Maria da Graca C.
    Goularte, Rudinei
    Cattelan, Renan G.
    Santos, Felipe S.
    Teixeira, Cesar
    CHANGING TELEVISION ENVIRONMENTS, PROCEEDINGS, 2008, 5066 : 72 - +