How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
|
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 50 条
  • [1] Vision Transformers with Hierarchical Attention
    Liu, Yun
    Wu, Yu-Huan
    Sun, Guolei
    Zhang, Le
    Chhatkuli, Ajad
    Van Gool, Luc
    MACHINE INTELLIGENCE RESEARCH, 2024, 21 (04) : 670 - 683
  • [2] Constituent Attention for Vision Transformers
    Li, Haoling
    Xue, Mengqi
    Song, Jie
    Zhang, Haofei
    Huang, Wenqi
    Liang, Lingyu
    Song, Mingli
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [3] Robustifying Token Attention for Vision Transformers
    Guo, Yong
    Stutz, David
    Schiele, Bernt
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17511 - 17522
  • [4] How does the cerebral cortex work? Learning, attention, and grouping by the laminar circuits of visual cortex
    Grossberg, S
    SPATIAL VISION, 1999, 12 (02): : 163 - 185
  • [5] Efficient Vision Transformers with Partial Attention
    Vo, Xuan-Thuy
    Nguyen, Duy-Linh
    Priadana, Adri
    Jo, Kang-Hyun
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 298 - 317
  • [6] Fast Vision Transformers with HiLo Attention
    Pan, Zizheng
    Cai, Jianfei
    Zhuang, Bohan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [7] DaViT: Dual Attention Vision Transformers
    Ding, Mingyu
    Xiao, Bin
    Codella, Noel
    Luo, Ping
    Wang, Jingdong
    Yuan, Lu
    COMPUTER VISION, ECCV 2022, PT XXIV, 2022, 13684 : 74 - 92
  • [8] Prioritization in Visual Attention Does Not Work the Way You Think It Does
    Ng, Gavin J. P.
    Buetti, Simona
    Patel, Trisha N.
    Lleras, Alejandro
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 2021, 47 (02) : 252 - 268
  • [9] How Does Fearful Emotion Affect Visual Attention?
    Shang, Zhe
    Wang, Yingying
    Bi, Taiyong
    FRONTIERS IN PSYCHOLOGY, 2021, 11
  • [10] AttnZero: Efficient Attention Discovery for Vision Transformers
    Li, Lujun
    Wei, Zimian
    Dong, Peijie
    Luo, Wenhan
    Xue, Wei
    Liu, Qifeng
    Guo, Yike
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 20 - 37