Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引:0
|
作者
Kim, Hyun-Woo [1 ]
Choi, Yong-Suk [2 ]
机构
[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea
[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea
基金
新加坡国家研究基金会;
关键词
action recognition; fusion attention; temporal redundancy;
D O I
10.3390/s24216842
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Residual attention unit for action recognition
    Liao, Zhongke
    Hu, Haifeng
    Zhang, Junxuan
    Yin, Chang
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [22] Attention with structure regularization for action recognition
    Quan, Yuhui
    Chen, Yixin
    Xu, Ruotao
    Ji, Hui
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 187
  • [23] Imperceptible Adversarial Attack With Multigranular Spatiotemporal Attention for Video Action Recognition
    Wu, Guoming
    Xu, Yangfan
    Li, Jun
    Shi, Zhiping
    Liu, Xianglong
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (20) : 17785 - 17796
  • [24] An attention mechanism based convolutional LSTM network for video action recognition
    Ge, Hongwei
    Yan, Zehang
    Yu, Wenhao
    Sun, Liang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (14) : 20533 - 20556
  • [25] CHANNEL-WISE TEMPORAL ATTENTION NETWORK FOR VIDEO ACTION RECOGNITION
    Lei, Jianjun
    Jia, Yalong
    Peng, Bo
    Huang, Qingming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 562 - 567
  • [26] Video action recognition method based on attention residual network and LSTM
    Zhang, Yu
    Dong, Pengyue
    PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 3611 - 3616
  • [27] An attention mechanism based convolutional LSTM network for video action recognition
    Hongwei Ge
    Zehang Yan
    Wenhao Yu
    Liang Sun
    Multimedia Tools and Applications, 2019, 78 : 20533 - 20556
  • [28] CAST: Cross-Attention in Space and Time for Video Action Recognition
    Lee, Dongho
    Lee, Jongseo
    Choi, Jinwoo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [29] Two-stream Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
  • [30] CANet: Comprehensive Attention Network for video-based action recognition
    Gao, Xiong
    Chang, Zhaobin
    Ran, Xingcheng
    Lu, Yonggang
    KNOWLEDGE-BASED SYSTEMS, 2024, 296