EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

被引:0
|
作者
Wang, Tiannan [1 ]
Zhou, Wangchunshu [2 ]
Zeng, Yan [3 ]
Zhang, Xinsong [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Bytedance AI LAB, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity.
引用
收藏
页码:13899 / 13913
页数:15
相关论文
共 50 条
  • [31] OptimCLM: Optimizing clinical language models for predicting patient outcomes via knowledge distillation, pruning and quantization
    Hasan, Mohammad Junayed
    Rahman, Fuad
    Mohammed, Nabeel
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2025, 195
  • [32] MEDICAL VISION-LANGUAGE REPRESENTATION LEARNING WITH CROSS-MODAL MULTI-TEACHER CONTRASTIVE DISTILLATION
    Chen, Bingzhi
    Zhu, Jiawei
    Liu, Yishu
    Zeng, Biqing
    Pan, Jiahui
    Ding, Meirong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1891 - 1895
  • [33] Correctable Landmark Discovery via Large Models for Vision-Language Navigation
    Lin, Bingqian
    Nie, Yunshuang
    Wei, Ziming
    Zhu, Yi
    Xu, Hang
    Ma, Shikui
    Liu, Jianzhuang
    Liang, Xiaodan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8534 - 8548
  • [34] Modal Interaction-Enhanced Prompt Learning by Transformer Decoder for Vision-Language Models
    Liu, Mingyue
    Zhao, Honggang
    Ma, Longfei
    Li, Xiang
    Ji, Yucheng
    Li, Mingyong
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 163 - 174
  • [35] BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
    Zhao, Xueliang
    Huang, Xinting
    Fu, Tingchen
    Li, Qintong
    Gong, Shansan
    Liu, Lemao
    Bi, Wei
    Kong, Lingpeng
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 7255 - 7279
  • [36] Modal interaction-enhanced prompt learning by transformer decoder for vision-language models
    Mingyue Liu
    Honggang Zhao
    Longfei Ma
    Mingyong Li
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [37] Modal interaction-enhanced prompt learning by transformer decoder for vision-language models
    Liu, Mingyue
    Zhao, Honggang
    Ma, Longfei
    Li, Mingyong
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (02)
  • [38] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
    Long, Sifan
    Zhao, Zhen
    Yuan, Junkun
    Tan, Zichang
    Liu, Jiangjiang
    Zhou, Luping
    Wang, Shengsheng
    Wang, Jingdong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
  • [39] Fine-grained multi-modal prompt learning for vision-language models
    Liu, Yunfei
    Deng, Yunziwei
    Liu, Anqi
    Liu, Yanan
    Li, Shengyang
    NEUROCOMPUTING, 2025, 636
  • [40] Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models
    Li, Jianwei
    Lei, Qi
    Cheng, Wei
    Xu, Dongkuan
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1229 - 1247