EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

被引:0
|
作者
Wang, Tiannan [1 ]
Zhou, Wangchunshu [2 ]
Zeng, Yan [3 ]
Zhang, Xinsong [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Bytedance AI LAB, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity.
引用
收藏
页码:13899 / 13913
页数:15
相关论文
共 50 条
  • [1] Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
    Dai, Wenliang
    Hou, Lu
    Shang, Lifeng
    Jiang, Xin
    Liu, Qun
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2383 - 2395
  • [2] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5798 - 5809
  • [3] Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
    Ye, Shuquan
    Xie, Yujia
    Chen, Dongdong
    Xu, Yichong
    Yuan, Lu
    Zhu, Chenguang
    Liao, Jing
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2634 - 2645
  • [4] cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
    Gupta, Kshitij
    Gautam, Devansh
    Mamidi, Radhika
    Proceedings - International Conference on Pattern Recognition, 2022, 2022-August : 1734 - 1741
  • [5] cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
    Gupta, Kshitij
    Gautam, Devansh
    Mamidi, Radhika
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1734 - 1741
  • [6] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
    Zhang, Wenbo
    Zhang, Yifan
    Lin, Jianfeng
    Huang, Binqiang
    Zhang, Jinlu
    Yu, Wenhao
    PATTERN RECOGNITION, 2025, 164
  • [7] Layerwised multimodal knowledge distillation for vision-language pretrained model
    Wang, Jin
    Liao, Dawei
    Zhang, You
    Xu, Dan
    Zhang, Xuejie
    NEURAL NETWORKS, 2024, 175
  • [8] Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval
    Sun, Lina
    Li, Yewen
    Dong, Yumin
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 499 - 507
  • [9] MMA: Multi-Modal Adapter for Vision-Language Models
    Yang, Lingxiao
    Zhang, Ru-Yuan
    Wang, Yanchen
    Xie, Xiaohua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23826 - +
  • [10] Multi-Modal Attribute Prompting for Vision-Language Models
    Liu, Xin
    Wu, Jiamin
    Yang, Wenfei
    Zhou, Xu
    Zhang, Tianzhu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11579 - 11591