How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
|
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 50 条
  • [21] Multimodal Vision Transformers with Forced Attention for Behavior Analysis
    Agrawal, Tanay
    Balazia, Michal
    Muller, Philipp
    Bremond, Francois
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3381 - 3391
  • [22] How does feature-based attention affect visual processing?
    Moore, CM
    Egeth, H
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 1998, 24 (04) : 1296 - 1310
  • [23] Related Work in Visual Analytics Tools
    不详
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2015, 35 (02) : 42 - 42
  • [24] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
    Chen, Shoufa
    Ge, Chongjian
    Tong, Zhan
    Wang, Jiangliu
    Song, Yibing
    Wang, Jue
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [25] Visual attention: the active vision perspective
    Findlay, J. M.
    PERCEPTION, 2000, 29 : 1 - 1
  • [26] Dyslexia: the Role of Vision and Visual Attention
    Stein J.
    Current Developmental Disorders Reports, 2014, 1 (4) : 267 - 280
  • [27] Visual attention model for computer vision
    Robert-Inacio, F.
    Yushchenko, L.
    BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES, 2014, 7 : 26 - 38
  • [28] DOES THE ATTENTION NEED TO BE VISUAL
    FINDLAY, JM
    BEHAVIORAL AND BRAIN SCIENCES, 1993, 16 (03) : 576 - 577
  • [29] Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
    Chen, Xiangyu
    Hu, Qinghao
    Li, Kaidong
    Zhong, Cuncong
    Wang, Guanghui
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3973 - 3981
  • [30] An Attention-Based Token Pruning Method for Vision Transformers
    Luo, Kaicheng
    Li, Huaxiong
    Zhou, Xianzhong
    Huang, Bing
    ROUGH SETS, IJCRS 2022, 2022, 13633 : 274 - 288