Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引:0
|
作者
Kim, Hyun-Woo [1 ]
Choi, Yong-Suk [2 ]
机构
[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea
[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea
基金
新加坡国家研究基金会;
关键词
action recognition; fusion attention; temporal redundancy;
D O I
10.3390/s24216842
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.
引用
收藏
页数:18
相关论文
共 50 条
  • [41] DYNAMIC TRACKING ATTENTION MODEL FOR ACTION RECOGNITION
    Wang, Chien-Yao
    Chiang, Chin-Chin
    Ding, Jian-Jiun
    Wang, Jia-Ching
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 1617 - 1621
  • [42] Learning Spatiotemporal Attention for Egocentric Action Recognition
    Lu, Minlong
    Liao, Danping
    Li, Ze-Nian
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4425 - 4434
  • [43] Dual attention convolutional network for action recognition
    Li, Xiaoqiang
    Xie, Miao
    Zhang, Yin
    Ding, Guangtai
    Tong, Weiqin
    IET IMAGE PROCESSING, 2020, 14 (06) : 1059 - 1065
  • [44] Spatial-Temporal Attention for Action Recognition
    Sun, Dengdi
    Wu, Hanqing
    Ding, Zhuanlian
    Luo, Bin
    Tang, Jin
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 854 - 864
  • [45] Nesting spatiotemporal attention networks for action recognition
    Li, Jiapeng
    Wei, Ping
    Zheng, Nanning
    NEUROCOMPUTING, 2021, 459 : 338 - 348
  • [46] Adversarial Attention Networks for Early Action Recognition
    Zhang, Hong-Bo
    Pan, Wei-Xiang
    Du, Ji-Xiang
    Lei, Qing
    Chen, Yan
    Liu, Jing-Hua
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [47] Action Recognition with Visual Attention on Skeleton Images
    Yang, Zhengyuan
    Li, Yuncheng
    Yang, Jianchao
    Luo, Jiebo
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3309 - 3314
  • [48] IMPROVING HUMAN ACTION RECOGNITION BY TEMPORAL ATTENTION
    Liu, Zhikang
    Tian, Ye
    Wang, Zilei
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 870 - 874
  • [49] Convolutional Block Attention Module-Multimodal Feature-Fusion Action Recognition: Enabling Miner Unsafe Action Recognition
    Wang, Yu
    Chen, Xiaoqing
    Li, Jiaoqun
    Lu, Zengxiang
    SENSORS, 2024, 24 (14)
  • [50] Two-Level Attention Model Based Video Action Recognition Network
    Sang, Haifeng
    Zhao, Ziyu
    He, Dakuo
    IEEE ACCESS, 2019, 7 : 118388 - 118401