Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引:0
|
作者
Kim, Hyun-Woo [1 ]
Choi, Yong-Suk [2 ]
机构
[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea
[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea
基金
新加坡国家研究基金会;
关键词
action recognition; fusion attention; temporal redundancy;
D O I
10.3390/s24216842
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Metric-Based Attention Feature Learning for Video Action Recognition
    Kim, Dae Ha
    Anvarov, Fazliddin
    Lee, Jun Min
    Song, Byung Cheol
    IEEE ACCESS, 2021, 9 : 39218 - 39228
  • [32] Human Action Recognition Based on Improved Fusion Attention CNN and RNN
    Zhao, Han
    Jin, Xinyu
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND APPLICATIONS (ICCIA 2020), 2020, : 108 - 112
  • [33] Fusion Attention Graph Convolutional Network with Hyperskeleton for UAV Action Recognition
    Liu, Fang
    Huang, Sheng
    Dai, Qin
    Liu, Cuiwei
    Shi, Xiangbin
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 90 - 102
  • [34] A Video Action Recognition Method via Dual-Stream Feature Fusion Neural Network with Attention
    Han, Jianmin
    Li, Jie
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2024, 32 (04) : 673 - 694
  • [35] Dense Dilated Network for Video Action Recognition
    Xu, Baohan
    Ye, Hao
    Zheng, Yingbin
    Wang, Heng
    Luwang, Tianyu
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (10) : 4941 - 4953
  • [36] LGANet: Local and global attention are both you need for action recognition
    Wang, Hao
    Zhao, Bin
    Zhang, Wenjia
    Liu, Guohua
    IET IMAGE PROCESSING, 2023, 17 (12) : 3453 - 3463
  • [37] Global and Local Knowledge-Aware Attention Network for Action Recognition
    Zheng, Zhenxing
    An, Gaoyun
    Wu, Dapeng
    Ruan, Qiuqi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (01) : 334 - 347
  • [38] Algorithm for Skeleton Action Recognition by Integrating Attention Mechanism and Convolutional Neural Networks
    Liu, Jianhua
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (08) : 604 - 613
  • [39] Deep Attention Network for Egocentric Action Recognition
    Lu, Minlong
    Li, Ze-Nian
    Wang, Yueming
    Pan, Gang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (08) : 3703 - 3713
  • [40] Temporal Cross-Attention for Action Recognition
    Hashiguchi, Ryota
    Tamaki, Toru
    COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 283 - 294