Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

被引:0
|
作者
Wu, Qiong [1 ,2 ]
Yu, Wei [1 ,2 ]
Zhou, Yiyi [1 ,2 ]
Huang, Shubin [1 ]
Sun, Xiaoshuai [1 ,2 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Co, Minist Educ China, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e., adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to a bunch of representative VLP models, and conduct extensive experiments on a set of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in https://github. com/DoubtedSteam/DAS.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
    Zhang, Peng-Fei
    Huang, Zi
    Bai, Guangdong
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
  • [2] Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation
    Li, Qi
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1308 - 1317
  • [3] MAPL : Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
    Manas, Oscar
    Rodriguez, Pau
    Ahmadi, Saba
    Nematzadeh, Aida
    Goyal, Yash
    Agrawal, Aishwarya
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2523 - 2548
  • [4] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    [J]. 2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [5] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    [J]. AI OPEN, 2024, 5 : 30 - 38
  • [6] p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
    Wu, Haoyuan
    Zhang, Xinyun
    Xu, Peng
    Liao, Peiyu
    Yao, Xufeng
    Yu, Bei
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6003 - 6011
  • [7] Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
    Kawaharazuka, Kento
    Obinata, Yoshiki
    Kanazawa, Naoaki
    Okada, Kei
    Inaba, Masayuki
    [J]. 2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [8] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
    Yin, Ziyi
    Ye, Muchao
    Zhang, Tianrong
    Du, Tianyu
    Zhu, Jinguo
    Liu, Han
    Chen, Jinghui
    Wang, Ting
    Ma, Fenglong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [10] Open-World Object Manipulation using Pre-Trained Vision-Language Models
    Stone, Austin
    Xiao, Ted
    Lu, Yao
    Gopalakrishnan, Keerthana
    Lee, Kuang-Huei
    Quan Vuong
    Wohlhart, Paul
    Kirmani, Sean
    Zitkovich, Brianna
    Xia, Fei
    Finn, Chelsea
    Hausman, Karol
    [J]. CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229