ReViT: Enhancing vision transformers feature diversity with attention residual connections

被引:3
|
作者
Diko, Anxhelo [1 ]
Avola, Danilo [1 ]
Cascio, Marco [1 ,2 ]
Cinque, Luigi [1 ]
机构
[1] Sapienza Univ Rome, Dept Comp Sci, Via Salaria 113, I-00198 Rome, Italy
[2] Univ Rome UnitelmaSapienza, Dept Law & Econ, Piazza Sassari 4, I-00161 Rome, Italy
关键词
Vision transformer; Feature collapse; Self-attention mechanism; Residual attention learning; Visual recognition;
D O I
10.1016/j.patcog.2024.110853
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify elements within an image and increase the accuracy and robustness of vision-based recognition systems. Following this rationale, we propose a novel residual attention learning method for improving ViT-based architectures, increasing their visual feature diversity and model robustness. In this way, the proposed network can capture and preserve significant low-level features, providing more details about the elements within the scene being analyzed. The effectiveness and robustness of the presented method are evaluated on five image classification benchmarks, including ImageNet1k, CIFAR10, CIFAR100, Oxford Flowers-102, and Oxford-IIIT Pet, achieving improved performances. Additionally, experiments on the COCO2017 dataset show that the devised approach discovers and incorporates semantic and spatial relationships for object detection and instance segmentation when implemented into spatial-aware transformer models.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Vision Transformers with Hierarchical Attention
    Liu, Yun
    Wu, Yu-Huan
    Sun, Guolei
    Zhang, Le
    Chhatkuli, Ajad
    Van Gool, Luc
    MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 670 - 683
  • [2] Constituent Attention for Vision Transformers
    Li, Haoling
    Xue, Mengqi
    Song, Jie
    Zhang, Haofei
    Huang, Wenqi
    Liang, Lingyu
    Song, Mingli
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [3] Robustifying Token Attention for Vision Transformers
    Guo, Yong
    Stutz, David
    Schiele, Bernt
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17511 - 17522
  • [4] Efficient Vision Transformers with Partial Attention
    Vo, Xuan-Thuy
    Nguyen, Duy-Linh
    Priadana, Adri
    Jo, Kang-Hyun
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 298 - 317
  • [5] Fast Vision Transformers with HiLo Attention
    Pan, Zizheng
    Cai, Jianfei
    Zhuang, Bohan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [6] DaViT: Dual Attention Vision Transformers
    Ding, Mingyu
    Xiao, Bin
    Codella, Noel
    Luo, Ping
    Wang, Jingdong
    Yuan, Lu
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 74 - 92
  • [7] AttnZero: Efficient Attention Discovery for Vision Transformers
    Li, Lujun
    Wei, Zimian
    Dong, Peijie
    Luo, Wenhan
    Xue, Wei
    Liu, Qifeng
    Guo, Yike
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 20 - 37
  • [8] Rethinking the Self-Attention in Vision Transformers
    Kim, Kyungmin
    Wu, Bichen
    Dai, Xiaoliang
    Zhang, Peizhao
    Yan, Zhicheng
    Vajda, Peter
    Kim, Seon
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3065 - 3069
  • [9] Multi-Manifold Attention for Vision Transformers
    Konstantinidis, Dimitrios
    Papastratis, Ilias
    Dimitropoulos, Kosmas
    Daras, Petros
    IEEE ACCESS, 2023, 11 : 123433 - 123444
  • [10] KVT: κ-NN Attention for Boosting Vision Transformers
    Wang, Pichao
    Wang, Xue
    Wang, Fan
    Lin, Ming
    Chang, Shuning
    Li, Hao
    Jin, Rong
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 285 - 302