EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

被引:0
|
作者
Wang, Tiannan [1 ]
Zhou, Wangchunshu [2 ]
Zeng, Yan [3 ]
Zhang, Xinsong [3 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] Bytedance AI LAB, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity.
引用
收藏
页码:13899 / 13913
页数:15
相关论文
共 50 条
  • [41] UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Liu, Shuo
    Shao, Wenqi
    Wu, Yuwei
    Luo, Ping
    Qiao, Yu
    Zhang, Kaipeng
    arXiv,
  • [42] A multi-modal vision-language pipeline strategy for contour quality assurance and adaptive optimization
    Luan, Shunyao
    Jun, Ou-yang
    Yang, Xiaofei
    Wei, Wei
    Xue, Xudong
    Zhu, Benpeng
    PHYSICS IN MEDICINE AND BIOLOGY, 2024, 69 (06):
  • [43] Grand: A Fast and Accurate Graph Retrieval Framework via Knowledge Distillation
    Lan, Lin
    Wang, Pinghui
    Shi, Rui
    Liu, Tingqing
    Zeng, Juxiang
    Sun, Feiyang
    Ren, Yang
    Tao, Jing
    Guan, Xiaohong
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 1639 - 1648
  • [44] Boosting adversarial transferability in vision-language models via multimodal feature heterogeneity
    Chen, Long
    Chen, Yuling
    Ouyang, Zhi
    Dou, Hui
    Zhang, Yangwen
    Sang, Haiwei
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [45] Concept-Based Analysis of Neural Networks via Vision-Language Models
    Mangal, Ravi
    Narodytska, Nina
    Gopinath, Divya
    Hu, Boyue Caroline
    Roy, Anirban
    Jha, Susmit
    Pasareanu, Corina S.
    AI VERIFICATION, SAIV 2024, 2024, 14846 : 49 - 77
  • [46] PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning
    Hussein, Noor
    Shamshad, Fahad
    Naseer, Muzammal
    Nandakumar, Karthik
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 698 - 708
  • [47] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [48] Transformer vision-language tracking via proxy token guided cross-modal fusion
    Zhao, Haojie
    Wang, Xiao
    Wang, Dong
    Lu, Huchuan
    Ruan, Xiang
    PATTERN RECOGNITION LETTERS, 2023, 168 : 10 - 16
  • [49] Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models via Diffusion Models
    Guo, Qi
    Pang, Shanmin
    Jia, Xiaojun
    Liu, Yang
    Guo, Qing
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1333 - 1348
  • [50] ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling
    Oezsoy, Ege
    Pellegrini, Chantal
    Keicher, Matthias
    Navab, Nassir
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT VI, 2024, 15006 : 455 - 465