Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

被引：0

作者：

Kong, Zhenglun ^{[1
]}

Ma, Haoyu ^{[2
]}

Yuan, Geng ^{[1
]}

Sun, Mengshu ^{[1
]}

Xie, Yanyue ^{[1
]}

Dong, Peiyan ^{[1
]}

Meng, Xin ^{[3
]}

Shen, Xuan ^{[1
]}

Tang, Hao ^{[4
]}

Qin, Minghai ^{[5
]}

Chen, Tianlong ^{[6
]}

Ma, Xiaolong ^{[7
]}

Xie, Xiaohui ^{[2
]}

Wang, Zhangyang ^{[6
]}

Wang, Yanzhi ^{[1
]}

机构：

[1] Northeastern Univ, Boston, MA 02115 USA

[2] Univ Calif Irvine, Irvine, CA USA

[3] Peking Univ, Beijing, Peoples R China

[4] Swiss Fed Inst Technol, CVL, Zurich, Switzerland

[5] Western Digital Res, San Jose, CA USA

[6] Univ Texas Austin, Austin, TX USA

[7] Clemson Univ, Clemson, SC USA

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7 | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to intro-duce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme by exploring the sparsity under three levels: the number of training examples in the dataset, the number of patches (tokens) in each example, and the num-ber of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT

引用

页码：8360 / 8368

页数：9

共 26 条

[1] Hierarchical Vision and Language Transformer for Efficient Visual Dialog
He, Qiangqiang
Zhang, Mujie
Zhang, Jie
Yang, Shang
Wang, Chongjun
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI, 2023, 14259 : 421 - 432
[2] Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification
Xue, Zhixiang
Tan, Xiong
Yu, Xuchu
Liu, Bing
Yu, Anzhu
Zhang, Pengqiang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3095 - 3110
[3] An Efficient Data Redundancy Reduction for Sensed Data Aggregators in Sensor Networks
Sampoomam, K. P.
Rameshwaran, K.
JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2015, 74 (01): : 29 - 33
[4] An Efficient Approach to Rule Redundancy Reduction in Hierarchical Phrase-Based Translation
Fang, Licheng
Zong, Chengqing
IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 466 - 471
[5] BinaryFormer: A Hierarchical-Adaptive Binary Vision Transformer (ViT) for Efficient Computing
Wang, Miaohui
Xu, Zhuowei
Zheng, Bin
Xie, Wuyuan
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10657 - 10668
[6] Hierarchical Representation Learning based spatio-temporal data redundancy reduction
Wang, Min
Yang, Shuyuan
Wu, Bin
NEUROCOMPUTING, 2016, 173 : 298 - 305
[7] Centroid-Centered Modeling for Efficient Vision Transformer Pre-Training
Van, Xin
Li, Zuchao
Zhang, Lefei
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT IV, 2025, 15034 : 465 - 479
[8] Data-efficient multi-scale fusion vision transformer
Tang, Hao
Liu, Dawei
Shen, Chengchao
PATTERN RECOGNITION, 2025, 161
[9] Structural redundancy reduction based efficient training for lightweight person re-identification
Wang, Hao
Sun, Yiwen
Bi, Xiaojun
INFORMATION SCIENCES, 2023, 637
[10] On Efficient Transformer-Based Image Pre-training for Low-Level Vision
Li, Wenbo
Lu, Xin
Qian, Shengju
Lu, Jiangbo
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1089 - 1097

← 1 2 3 →