Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引:0
|
作者
Kim, Hyun-Woo [1 ]
Choi, Yong-Suk [2 ]
机构
[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea
[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea
基金
新加坡国家研究基金会;
关键词
action recognition; fusion attention; temporal redundancy;
D O I
10.3390/s24216842
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Integrating Temporal and Spatial Attention for Video Action Recognition
    Zhou, Yuanding
    Li, Baopu
    Wang, Zhihui
    Li, Haojie
    SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
  • [2] Residual attention fusion network for video action recognition
    Li, Ao
    Yi, Yang
    Liang, Daan
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 98
  • [3] Sparse Dense Transformer Network for Video Action Recognition
    Qu, Xiaochun
    Zhang, Zheyuan
    Xiao, Wei
    Ran, Jinye
    Wang, Guodong
    Zhang, Zili
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 43 - 56
  • [4] Shrinking Temporal Attention in Transformers for Video Action Recognition
    Li, Bonan
    Xiong, Pengfei
    Han, Congying
    Guo, Tiande
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1263 - 1271
  • [5] SCA Net: Sparse Channel Attention Module for Action Recognition
    Song, Hang
    Song, YongHong
    Zhang, YuanLin
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1189 - 1196
  • [6] Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition
    Aghaei A.
    Nazari A.
    Moghaddam M.E.
    SN Computer Science, 2021, 2 (3)
  • [7] Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition
    Ou, Hongshi
    Sun, Jifeng
    JOURNAL OF ELECTRONIC IMAGING, 2019, 28 (02)
  • [8] Recurrent Region Attention and Video Frame Attention Based Video Action Recognition Network Design
    Sang H.-F.
    Zhao Z.-Y.
    He D.-K.
    Zhao, Zi-Yu (Maikuraky1022@outlook.com), 1600, Chinese Institute of Electronics (48): : 1052 - 1061
  • [9] Multipath Attention and Adaptive Gating Network for Video Action Recognition
    Haiping Zhang
    Zepeng Hu
    Dongjin Yu
    Liming Guan
    Xu Liu
    Conghao Ma
    Neural Processing Letters, 56
  • [10] Human Skeleton Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    2020 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE, COMPUTER TECHNOLOGY AND TRANSPORTATION (ISCTT 2020), 2020, : 183 - 187