How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

被引:9
|
作者
Li, Yiran [1 ]
Wang, Junpeng [2 ]
Dai, Xin [2 ]
Wang, Liang [2 ]
Yeh, Chin-Chia Michael [2 ]
Zheng, Yan [2 ]
Zhang, Wei [2 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] Visa Res, Palo Alto, CA 94301 USA
关键词
Head; Transformers; Visual analytics; Task analysis; Measurement; Heating systems; Deep learning; explainable artificial intelligence; multi-head self-attention; vision transformer; visual analytics;
D O I
10.1109/TVCG.2023.3261935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
引用
收藏
页码:2888 / 2900
页数:13
相关论文
共 50 条
  • [31] RAWAtten: Reconfigurable Accelerator for Window Attention in Hierarchical Vision Transformers
    Li, Wantong
    Luo, Yandong
    Yu, Shimeng
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [32] Focal Attention for Long-Range Interactions in Vision Transformers
    Yang, Jianwei
    Li, Chunyuan
    Zhang, Pengchuan
    Dai, Xiyang
    Xiao, Bin
    Yuan, Lu
    Gao, Jianfeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [33] Visual field and road traffic. How does peripheral vision function?
    Lachenmayr, B
    OPHTHALMOLOGE, 2006, 103 (05): : 373 - +
  • [34] Tutorial: How Does Your HMI Design Affect the Visual Attention of the Driver
    Feuerstack, Sebastian
    Wortelen, Bertram
    AUTOMOTIVEUI'17: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON AUTOMOTIVE USER INTERFACES AND INTERACTIVE VEHICULAR APPLICATIONS, 2017, : 28 - 32
  • [35] How Does Visual Attention Differ Between Experts and Novices on Physics Problems?
    Carmichael, Adrian
    Larson, Adam
    Gire, Elizabeth
    Loschky, Lester
    Rebello, N. Sanjay
    2010 PHYSICS EDUCATION RESEARCH CONFERENCE, 2010, 1289 : 93 - +
  • [36] VITALITY: Promoting Serendipitous Discovery of Academic Literature with Transformers & Visual Analytics
    Narechania, Arpit
    Karduni, Alireza
    Wesslen, Ryan
    Wall, Emily
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (01) : 486 - 496
  • [37] Vision and attention. I: Current models of visual attention
    Steinman, SB
    Steinman, BA
    OPTOMETRY AND VISION SCIENCE, 1998, 75 (02) : 146 - 155
  • [38] Vision and attention. I: Current models of visual attention
    Department of Biomedical Sciences, Southern College of Optometry, Memphis, TN, United States
    不详
    Optom. Vis. Sci., 2 (146-155):
  • [39] How RF transformers work - Part 1
    不详
    ELECTRONIC ENGINEERING, 1998, 70 (863): : 13 - +
  • [40] How does it work?
    Hyde, T. P.
    BRITISH DENTAL JOURNAL, 2006, 200 (09) : 477 - 477