Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning

被引:0
|
作者
Li Pengyu [1 ]
Guo Tianchu [1 ]
Wang Biao [1 ]
Hua Xian-Sheng [1 ]
机构
[1] Terminus Grp, Terminus Labs, Beijing, Peoples R China
来源
关键词
D O I
10.1007/978-3-031-72973-7_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, transformer-based large vision models, e.g., the Segment Anything Model (SAM) and Stable Diffusion (SD), have achieved remarkable success in the computer vision field. However, the quartic complexity within the transformer's Multi-Head Attention (MHA) leads to substantial computational costs in these models whose inputs and outputs are high-resolution. Although several prior works attempted to alleviate this challenge, none have successfully reduced the complexity and latency of large vision models while preserving their remarkable capabilities without requiring enormous efforts and GPU hours to re-train or fine-tune themodels. To address the challenge, we propose a simple yet effective plug-and-play transformer block called Grid-Attention(GridAttn). The GridAttn integrates the proposed Grid Clustering module, Grid Distributing strategies, and Grid Recovering module with common MHA to enhance the large vision models' computational efficiency and preserve their performance without the need for re-training or fine-tuning their parameters. We conduct extensive experiments on recent high-resolution tasks, including zero-shot instance segmentation (SAM, Expedit-SAM), text-to-image generation (Stable Diffusion V2.1), and semantic segmentation (SegFormer B0-B5). The experiments demonstrate that: Without any training or fine-tuning, GridAttn reduces GFlops by the range of [4.6%, 16.1%] and GPU inference latency by [8.2%, 21.4%], all while achieving equivalent performance (the performance bias ratio is less than 1%). Furthermore, the experiments present that GridAttn can also be trained from scratch or fine-tuned with very slight fine-tuning costs, resulting in a significantly improved performance-efficiency tradeoff. As a recommendation, we encourage the community to incorporate our GridAttn whenever deploying a well-trained transformer directly, fine-tuning a pre-trained one, or training a new one from scratch. The source code will be released in https://github.com/pengyuLPY/GridAttn.
引用
收藏
页码:54 / 70
页数:17
相关论文
共 50 条
  • [1] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    arXiv, 1600,
  • [2] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    Zong, Yongshuo
    Bohdal, Ondrej
    Yu, Tingyang
    Yang, Yongxin
    Hospedales, Timothy
    Proceedings of Machine Learning Research, 2024, 235 : 62867 - 62891
  • [3] Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning
    Yuan, Yuhui
    Liang, Weicong
    Ding, Henghui
    Liang, Zhanhao
    Zhang, Chao
    Hu, Han
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (01) : 250 - 266
  • [4] Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning
    Liang, Weicong
    Yuan, Yuhui
    Ding, Henghui
    Luo, Xiao
    Lin, Weihong
    Jia, Ding
    Zhang, Zheng
    Zhang, Chao
    Hu, Han
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Phased Instruction Fine-Tuning for Large Language Models
    Pang, Wei
    Zhou, Chuan
    Zhou, Xiao-Hua
    Wang, Xiaojie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5735 - 5748
  • [6] HackMentor: Fine-Tuning Large Language Models for Cybersecurity
    Zhang, Jie
    Wen, Hui
    Deng, Liting
    Xin, Mingfeng
    Li, Zhi
    Li, Lun
    Zhu, Hongsong
    Sun, Limin
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 452 - 461
  • [7] PETALS: Collaborative Inference and Fine-tuning of Large Models
    Borzunov, Alexander
    Baranchuk, Dmitry
    Dettmers, Tim
    Ryabinin, Max
    Belkada, Younes
    Chumachenko, Artem
    Samygin, Pavel
    Raffel, Colin
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-DEMO 2023, VOL 3, 2023, : 558 - 568
  • [8] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
    Zhu, Beier
    Niu, Yulei
    Lee, Saeil
    Hur, Minhoe
    Zhang, Hanwang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
  • [9] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
    Vogt-Lowell, Kevin
    Lee, Noah
    Tsiligkaridis, Theodoros
    Vaillant, Marc
    2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
  • [10] Demystifying Instruction Mixing for Fine-tuning Large Language Models
    Wang, Renxi
    Li, Haonan
    Wu, Minghao
    Wang, Yuxia
    Han, Xudong
    Zhang, Chiyu
    Baldwin, Timothy
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 86 - 93