EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

被引:0
|
作者
Wang, Tiannan [1 ]
Zhou, Wangchunshu [2 ]
Zeng, Yan [3 ]
Zhang, Xinsong [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Bytedance AI LAB, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity.
引用
收藏
页码:13899 / 13913
页数:15
相关论文
共 50 条
  • [21] Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
    Hu, Yushi
    Stretcu, Otilia
    Lu, Chun-Ta
    Viswanathan, Krishnamurthy
    Hata, Kenji
    Luo, Enming
    Krishna, Ranjay
    Fuxman, Ariel
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 9590 - 9601
  • [22] Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
    Wu, Wenhao
    Wang, Xiaohan
    Luo, Haipeng
    Wang, Jingdong
    Yang, Yi
    Ouyang, Wanli
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6620 - 6630
  • [23] IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models
    Huang, Kai
    Zou, Hao
    Xi, Ye
    Wang, BoChen
    Xie, Zhen
    Yu, Liang
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 214 - 230
  • [24] UMPA: Unified multi-modal prompt with adapter for vision-language models
    Jin, Zhengwei
    Wei, Yun
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [25] Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing
    Nirala, Ashutosh
    Joshi, Ameya
    Sarkar, Soumik
    Hegde, Chinmay
    IEEE CONFERENCE ON SAFE AND TRUSTWORTHY MACHINE LEARNING, SATML 2024, 2024, : 252 - 271
  • [26] KDNet: Leveraging Vision-Language Knowledge Distillation for Few-Shot Object Detection
    Ma, Mengyuan
    Qian, Lin
    Yin, Hujun
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II, 2024, 15017 : 153 - 167
  • [27] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
    Wang, Yubin
    Jiang, Xinyang
    Cheng, De
    Li, Dongsheng
    Zhao, Cairong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757
  • [28] HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
    Ning, Shan
    Qiu, Longtian
    Liu, Yongfei
    He, Xuming
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23507 - 23517
  • [29] Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
    Kan, Baoshuo
    Wang, Teng
    Lu, Wenpeng
    Zhen, Xiantong
    Guan, Weili
    Zheng, Feng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15624 - 15634
  • [30] Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
    Peng, Wenshuo
    Zhang, Kaipeng
    Yang, Yue
    Zhang, Hao
    Qiao, Yu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4506 - 4514