Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

被引:0
|
作者
Kong, Zhenglun [1 ]
Ma, Haoyu [2 ]
Yuan, Geng [1 ]
Sun, Mengshu [1 ]
Xie, Yanyue [1 ]
Dong, Peiyan [1 ]
Meng, Xin [3 ]
Shen, Xuan [1 ]
Tang, Hao [4 ]
Qin, Minghai [5 ]
Chen, Tianlong [6 ]
Ma, Xiaolong [7 ]
Xie, Xiaohui [2 ]
Wang, Zhangyang [6 ]
Wang, Yanzhi [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Univ Calif Irvine, Irvine, CA USA
[3] Peking Univ, Beijing, Peoples R China
[4] Swiss Fed Inst Technol, CVL, Zurich, Switzerland
[5] Western Digital Res, San Jose, CA USA
[6] Univ Texas Austin, Austin, TX USA
[7] Clemson Univ, Clemson, SC USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to intro-duce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme by exploring the sparsity under three levels: the number of training examples in the dataset, the number of patches (tokens) in each example, and the num-ber of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT
引用
收藏
页码:8360 / 8368
页数:9
相关论文
共 26 条
  • [21] Reduction of the size of the learning data in a probabilistic neural network by hierarchical clustering. Application to the discrimination of seeds by artificial vision
    Chtioui, Y
    Bertrand, D
    Barba, D
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1996, 35 (02) : 175 - 186
  • [23] An Efficient Voxel-Based Segmentation Algorithm Based on Hierarchical Clustering to Extract LIDAR Power Equipment Data in Transformer Substations
    Guo, Jianlong
    Feng, Weixia
    Xue, Jiang
    Xiong, Shan
    Hao, Tengfei
    Li, Ruiheng
    Mao, Huben
    IEEE ACCESS, 2020, 8 : 227482 - 227496
  • [24] An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data-Index Decoupling Workflow
    Meng, Yishuo
    Yang, Chen
    Xiang, Siwei
    Wang, Jianfei
    Mei, Kuizhi
    Geng, Li
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (10) : 1537 - 1550
  • [25] Two-level energy-efficient data reduction strategies based on SAX-LZW and hierarchical clustering for minimizing the huge data conveyed on the internet of things networks
    Ali Kadhum M. Al-Qurabat
    Suha Abdulhussein Abdulzahra
    Ali Kadhum Idrees
    The Journal of Supercomputing, 2022, 78 : 17844 - 17890
  • [26] Two-level energy-efficient data reduction strategies based on SAX-LZW and hierarchical clustering for minimizing the huge data conveyed on the internet of things networks
    Al-Qurabat, Ali Kadhum M.
    Abdulzahra, Suha Abdulhussein
    Idrees, Ali Kadhum
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (16): : 17844 - 17890