How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
|
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 50 条
  • [11] Rethinking the Self-Attention in Vision Transformers
    Kim, Kyungmin
    Wu, Bichen
    Dai, Xiaoliang
    Zhang, Peizhao
    Yan, Zhicheng
    Vajda, Peter
    Kim, Seon
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3065 - 3069
  • [12] Multi-Manifold Attention for Vision Transformers
    Konstantinidis, Dimitrios
    Papastratis, Ilias
    Dimitropoulos, Kosmas
    Daras, Petros
    IEEE ACCESS, 2023, 11 : 123433 - 123444
  • [13] KVT: κ-NN Attention for Boosting Vision Transformers
    Wang, Pichao
    Wang, Xue
    Wang, Fan
    Lin, Ming
    Chang, Shuning
    Li, Hao
    Jin, Rong
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 285 - 302
  • [14] HOW DOES EXOGENOUS ORIENTING OF ATTENTION MODULATE VISUAL CONSCIOUSNESS?
    Bayle, Dimitri
    Valero-Cabre, Antoni
    Chica, Ana
    Tallon-Baudry, Catherine
    Bartolomeo, Paolo
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2013, : 63 - 63
  • [15] Self-attention in vision transformers performs perceptual grouping, not attention
    Mehrani, Paria
    Tsotsos, John K.
    FRONTIERS IN COMPUTER SCIENCE, 2023, 5
  • [16] Visual Transformers: Where Do Transformers Really Belong in Vision Models?
    Wu, Bichen
    Xu, Chenfeng
    Dai, Xiaoliang
    Wan, Alvin
    Zhang, Peizhao
    Yan, Zhicheng
    Tomizuka, Masayoshi
    Gonzalez, Joseph
    Keutzer, Kurt
    Vajda, Peter
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 579 - 589
  • [17] Less is More: Pay Less Attention in Vision Transformers
    Pan, Zizheng
    Zhuang, Bohan
    He, Haoyu
    Liu, Jing
    Cai, Jianfei
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2035 - 2043
  • [18] Rethinking Attention Mechanisms in Vision Transformers with Graph Structures
    Kim, Hyeongjin
    Ko, Byoung Chul
    SENSORS, 2024, 24 (04)
  • [19] Adaptive search for broad attention based vision transformers
    Li, Nannan
    Chen, Yaran
    Zhao, Dongbin
    NEUROCOMPUTING, 2025, 611
  • [20] Twins: Revisiting the Design of Spatial Attention in Vision Transformers
    Chu, Xiangxiang
    Tian, Zhi
    Wang, Yuqing
    Zhang, Bo
    Ren, Haibing
    Wei, Xiaolin
    Xia, Huaxia
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,