Grid-Attention: Enhancing Computational Efficiency of Large Vision Models Without Fine-Tuning

被引：0

作者：

Li Pengyu ^{[1
]}

Guo Tianchu ^{[1
]}

Wang Biao ^{[1
]}

Hua Xian-Sheng ^{[1
]}

机构：

[1] Terminus Grp, Terminus Labs, Beijing, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT L | 2025年 / 15108卷

关键词：

D O I：

10.1007/978-3-031-72973-7_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, transformer-based large vision models, e.g., the Segment Anything Model (SAM) and Stable Diffusion (SD), have achieved remarkable success in the computer vision field. However, the quartic complexity within the transformer's Multi-Head Attention (MHA) leads to substantial computational costs in these models whose inputs and outputs are high-resolution. Although several prior works attempted to alleviate this challenge, none have successfully reduced the complexity and latency of large vision models while preserving their remarkable capabilities without requiring enormous efforts and GPU hours to re-train or fine-tune themodels. To address the challenge, we propose a simple yet effective plug-and-play transformer block called Grid-Attention(GridAttn). The GridAttn integrates the proposed Grid Clustering module, Grid Distributing strategies, and Grid Recovering module with common MHA to enhance the large vision models' computational efficiency and preserve their performance without the need for re-training or fine-tuning their parameters. We conduct extensive experiments on recent high-resolution tasks, including zero-shot instance segmentation (SAM, Expedit-SAM), text-to-image generation (Stable Diffusion V2.1), and semantic segmentation (SegFormer B0-B5). The experiments demonstrate that: Without any training or fine-tuning, GridAttn reduces GFlops by the range of [4.6%, 16.1%] and GPU inference latency by [8.2%, 21.4%], all while achieving equivalent performance (the performance bias ratio is less than 1%). Furthermore, the experiments present that GridAttn can also be trained from scratch or fine-tuned with very slight fine-tuning costs, resulting in a significantly improved performance-efficiency tradeoff. As a recommendation, we encourage the community to incorporate our GridAttn whenever deploying a well-trained transformer directly, fine-tuning a pre-trained one, or training a new one from scratch. The source code will be released in https://github.com/pengyuLPY/GridAttn.

引用

页码：54 / 70

页数：17

共 50 条

[1] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Zong, Yongshuo
Bohdal, Ondrej
Yu, Tingyang
Yang, Yongxin
Hospedales, Timothy
arXiv, 1600,
[2] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Zong, Yongshuo
Bohdal, Ondrej
Yu, Tingyang
Yang, Yongxin
Hospedales, Timothy
Proceedings of Machine Learning Research, 2024, 235 : 62867 - 62891
[3] Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning
Yuan, Yuhui
Liang, Weicong
Ding, Henghui
Liang, Zhanhao
Zhang, Chao
Hu, Han
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (01) : 250 - 266
[4] Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning
Liang, Weicong
Yuan, Yuhui
Ding, Henghui
Luo, Xiao
Lin, Weihong
Jia, Ding
Zhang, Zheng
Zhang, Chao
Hu, Han
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[5] Phased Instruction Fine-Tuning for Large Language Models
Pang, Wei
Zhou, Chuan
Zhou, Xiao-Hua
Wang, Xiaojie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5735 - 5748
[6] HackMentor: Fine-Tuning Large Language Models for Cybersecurity
Zhang, Jie
Wen, Hui
Deng, Liting
Xin, Mingfeng
Li, Zhi
Li, Lun
Zhu, Hongsong
Sun, Limin
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 452 - 461
[7] PETALS: Collaborative Inference and Fine-tuning of Large Models
Borzunov, Alexander
Baranchuk, Dmitry
Dettmers, Tim
Ryabinin, Max
Belkada, Younes
Chumachenko, Artem
Samygin, Pavel
Raffel, Colin
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-DEMO 2023, VOL 3, 2023, : 558 - 568
[8] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
Zhu, Beier
Niu, Yulei
Lee, Saeil
Hur, Minhoe
Zhang, Hanwang
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
[9] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
Vogt-Lowell, Kevin
Lee, Noah
Tsiligkaridis, Theodoros
Vaillant, Marc
2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
[10] Demystifying Instruction Mixing for Fine-tuning Large Language Models
Wang, Renxi
Li, Haonan
Wu, Minghao
Wang, Yuxia
Han, Xudong
Zhang, Chiyu
Baldwin, Timothy
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 86 - 93

← 1 2 3 4 5 →