MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

被引:0
|
作者
Farina, Matteo [1 ]
Mancini, Massimiliano [1 ]
Cunegatti, Elia [1 ]
Liu, Gaowen [2 ]
Iacca, Giovanni [1 ]
Ricci, Elisa [1 ,3 ]
机构
[1] Univ Trento, Trento, Italy
[2] Cisco Res, Res Triangle Pk, NC USA
[3] Fdn Bruno Kessler, Povo, Italy
关键词
D O I
10.1109/CVPR52733.2024.01532
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neu-rons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.
引用
收藏
页码:16185 / 16195
页数:11
相关论文
共 49 条
  • [31] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
    Dai, Wenliang
    Li, Junnan
    Li, Dongxu
    Tiong, Anthony Meng Huat
    Zhao, Junqi
    Wang, Weisheng
    Li, Boyang
    Fung, Pascale
    Hoi, Steven
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [32] Towards Multimodal Vision-Language Models Generating Non-generic Text
    Robbins, Wes
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 13138 - 13139
  • [33] EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
    Wang, Tiannan
    Zhou, Wangchunshu
    Zeng, Yan
    Zhang, Xinsong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 13899 - 13913
  • [34] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
    Long, Sifan
    Zhao, Zhen
    Yuan, Junkun
    Tan, Zichang
    Liu, Jiangjiang
    Zhou, Luping
    Wang, Shengsheng
    Wang, Jingdong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
  • [35] Multi-task prompt tuning with soft context sharing for vision-language models
    Ding, Kun
    Wang, Ying
    Liu, Pengzhang
    Yu, Qiang
    Zhang, Haojian
    Xiang, Shiming
    Pan, Chunhong
    NEUROCOMPUTING, 2024, 603
  • [36] Align vision-language semantics by multi-task learning for multi-modal summarization
    Cui C.
    Liang X.
    Wu S.
    Li Z.
    Neural Computing and Applications, 2024, 36 (25) : 15653 - 15666
  • [37] Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
    Li, Rongjie
    Wu, Yu
    He, Xuming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13428 - 13437
  • [38] TAGGAR: General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models
    Stover, Daniel
    Bowman, Doug A.
    PROCEEDINGS OF THE 2024 ACM SYMPOSIUM ON SPATIAL USER INTERACTION, SUI 2024, 2024,
  • [39] Experiential Views: Towards Human Experience Evaluation of Designed Spaces using Vision-Language Models
    Aseniero, Bon Adriel
    Lee, Michael
    Wang, Yi
    Zhou, Qian
    Shahmansouri, Nastaran
    Goldstein, Rhys
    EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
  • [40] CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
    Zhu, Hongguang
    Wei, Yunchao
    Liang, Xiaodan
    Zhang, Chunjie
    Zhao, Yao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22200 - 22210