Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

被引:0
|
作者
Kong, Zhenglun [1 ]
Ma, Haoyu [2 ]
Yuan, Geng [1 ]
Sun, Mengshu [1 ]
Xie, Yanyue [1 ]
Dong, Peiyan [1 ]
Meng, Xin [3 ]
Shen, Xuan [1 ]
Tang, Hao [4 ]
Qin, Minghai [5 ]
Chen, Tianlong [6 ]
Ma, Xiaolong [7 ]
Xie, Xiaohui [2 ]
Wang, Zhangyang [6 ]
Wang, Yanzhi [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Univ Calif Irvine, Irvine, CA USA
[3] Peking Univ, Beijing, Peoples R China
[4] Swiss Fed Inst Technol, CVL, Zurich, Switzerland
[5] Western Digital Res, San Jose, CA USA
[6] Univ Texas Austin, Austin, TX USA
[7] Clemson Univ, Clemson, SC USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to intro-duce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme by exploring the sparsity under three levels: the number of training examples in the dataset, the number of patches (tokens) in each example, and the num-ber of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT
引用
收藏
页码:8360 / 8368
页数:9
相关论文
共 26 条
  • [1] Hierarchical Vision and Language Transformer for Efficient Visual Dialog
    He, Qiangqiang
    Zhang, Mujie
    Zhang, Jie
    Yang, Shang
    Wang, Chongjun
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI, 2023, 14259 : 421 - 432
  • [2] Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification
    Xue, Zhixiang
    Tan, Xiong
    Yu, Xuchu
    Liu, Bing
    Yu, Anzhu
    Zhang, Pengqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3095 - 3110
  • [3] An Efficient Data Redundancy Reduction for Sensed Data Aggregators in Sensor Networks
    Sampoomam, K. P.
    Rameshwaran, K.
    JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2015, 74 (01): : 29 - 33
  • [4] An Efficient Approach to Rule Redundancy Reduction in Hierarchical Phrase-Based Translation
    Fang, Licheng
    Zong, Chengqing
    IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 466 - 471
  • [5] BinaryFormer: A Hierarchical-Adaptive Binary Vision Transformer (ViT) for Efficient Computing
    Wang, Miaohui
    Xu, Zhuowei
    Zheng, Bin
    Xie, Wuyuan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10657 - 10668
  • [6] Hierarchical Representation Learning based spatio-temporal data redundancy reduction
    Wang, Min
    Yang, Shuyuan
    Wu, Bin
    NEUROCOMPUTING, 2016, 173 : 298 - 305
  • [7] Centroid-Centered Modeling for Efficient Vision Transformer Pre-Training
    Van, Xin
    Li, Zuchao
    Zhang, Lefei
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT IV, 2025, 15034 : 465 - 479
  • [8] Data-efficient multi-scale fusion vision transformer
    Tang, Hao
    Liu, Dawei
    Shen, Chengchao
    PATTERN RECOGNITION, 2025, 161
  • [9] Structural redundancy reduction based efficient training for lightweight person re-identification
    Wang, Hao
    Sun, Yiwen
    Bi, Xiaojun
    INFORMATION SCIENCES, 2023, 637
  • [10] On Efficient Transformer-Based Image Pre-training for Low-Level Vision
    Li, Wenbo
    Lu, Xin
    Qian, Shengju
    Lu, Jiangbo
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1089 - 1097