MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

被引：0

作者：

Farina, Matteo ^{[1
]}

Mancini, Massimiliano ^{[1
]}

Cunegatti, Elia ^{[1
]}

Liu, Gaowen ^{[2
]}

Iacca, Giovanni ^{[1
]}

Ricci, Elisa ^{[1
,3
]}

机构：

[1] Univ Trento, Trento, Italy

[2] Cisco Res, Res Triangle Pk, NC USA

[3] Fdn Bruno Kessler, Povo, Italy

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01532

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neu-rons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.

引用

页码：16185 / 16195

页数：11

共 49 条

[31] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Dai, Wenliang
Li, Junnan
Li, Dongxu
Tiong, Anthony Meng Huat
Zhao, Junqi
Wang, Weisheng
Li, Boyang
Fung, Pascale
Hoi, Steven
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[32] Towards Multimodal Vision-Language Models Generating Non-generic Text
Robbins, Wes
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 13138 - 13139
[33] EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Wang, Tiannan
Zhou, Wangchunshu
Zeng, Yan
Zhang, Xinsong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 13899 - 13913
[34] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
Long, Sifan
Zhao, Zhen
Yuan, Junkun
Tan, Zichang
Liu, Jiangjiang
Zhou, Luping
Wang, Shengsheng
Wang, Jingdong
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
[35] Multi-task prompt tuning with soft context sharing for vision-language models
Ding, Kun
Wang, Ying
Liu, Pengzhang
Yu, Qiang
Zhang, Haojian
Xiang, Shiming
Pan, Chunhong
NEUROCOMPUTING, 2024, 603
[36] Align vision-language semantics by multi-task learning for multi-modal summarization
Cui C.
Liang X.
Wu S.
Li Z.
Neural Computing and Applications, 2024, 36 (25) : 15653 - 15666
[37] Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Li, Rongjie
Wu, Yu
He, Xuming
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13428 - 13437
[38] TAGGAR: General-Purpose Task Guidance from Natural Language in Augmented Reality using Vision-Language Models
Stover, Daniel
Bowman, Doug A.
PROCEEDINGS OF THE 2024 ACM SYMPOSIUM ON SPATIAL USER INTERACTION, SUI 2024, 2024,
[39] Experiential Views: Towards Human Experience Evaluation of Designed Spaces using Vision-Language Models
Aseniero, Bon Adriel
Lee, Michael
Wang, Yi
Zhou, Qian
Shahmansouri, Nastaran
Goldstein, Rhys
EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
[40] CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Zhu, Hongguang
Wei, Yunchao
Liang, Xiaodan
Zhang, Chunjie
Zhao, Yao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22200 - 22210

← 1 2 3 4 5 →