Cascade multi-head attention networks for action recognition

被引:21
|
作者
Wang, Jiaze [1 ,2 ]
Peng, Xiaojiang [1 ,2 ]
Qiao, Yu [1 ,2 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, SIAT SenseTime Joint Lab, ShenZhen Key Lab Comp Vis & Pattern Recognit, Shenzhen, Peoples R China
[2] Shenzhen Inst Artificial Intelligence & Robot Soc, SIAT Branch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Cascade multi-head attention network; Feature aggregation; Visual analysis;
D O I
10.1016/j.cviu.2019.102898
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long-term temporal information yields crucial cues for video action understanding. Previous researches always rely on sequential models such as recurrent networks, memory units, segmental models, self-attention mechanism to integrate the local temporal features for long-term temporal modeling. Recurrent or memory networks record temporal patterns (or relations) by memory units, which are proved to be difficult to capture long-term information in machine translation. Self-attention mechanisms directly aggregate all local information with attention weights which is more straightforward and efficient than the former. However, the attention weights from self-attention ignore the relations between local information and global information which may lead to unreliable attention. To this end, we propose a new attention network architecture, termed as Cascade multi-head ATtention Network (CATNet), which constructs video representations with two-level attentions, namely multi-head local self-attentions and relation based global attentions. Starting from the segment features generated by backbone networks, CATNet first learns multiple attention weights for each segment to capture the importance of local features in a self-attention manner. With the local attention weights, CATNet integrates local features into several global representations, and then learns the second level attention for the global information by a relation manner. Extensive experiments on Kinetics, HMDB51, and UCF101 show that our CATNet boosts the baseline network with a large margin. With only RGB information, we respectively achieve 75.8%, 75.2%, and 96.0% on these three datasets, which are comparable or superior to the state of the arts.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Self Multi-Head Attention for Speaker Recognition
    India, Miquel
    Safari, Pooyan
    Hernando, Javier
    INTERSPEECH 2019, 2019, : 4305 - 4309
  • [2] Combining Multi-Head Attention and Sparse Multi-Head Attention Networks for Session-Based Recommendation
    Zhao, Zhiwei
    Wang, Xiaoye
    Xiao, Yingyuan
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [3] Multi-head attention fusion networks for multi-modal speech emotion recognition
    Zhang, Junfeng
    Xing, Lining
    Tan, Zhen
    Wang, Hongsen
    Wang, Kesheng
    COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 168
  • [4] Improving Multi-head Attention with Capsule Networks
    Gu, Shuhao
    Feng, Yang
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING (NLPCC 2019), PT I, 2019, 11838 : 314 - 326
  • [5] Multi-head attention-based two-stream EfficientNet for action recognition
    Zhou, Aihua
    Ma, Yujun
    Ji, Wanting
    Zong, Ming
    Yang, Pei
    Wu, Min
    Liu, Mingzhe
    MULTIMEDIA SYSTEMS, 2023, 29 (02) : 487 - 498
  • [6] Multi-head attention-based two-stream EfficientNet for action recognition
    Aihua Zhou
    Yujun Ma
    Wanting Ji
    Ming Zong
    Pei Yang
    Min Wu
    Mingzhe Liu
    Multimedia Systems, 2023, 29 : 487 - 498
  • [7] Multi-head multi-order graph attention networks
    Ben, Jie
    Sun, Qiguo
    Liu, Keyu
    Yang, Xibei
    Zhang, Fengjun
    APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8092 - 8107
  • [8] Acoustic Scene Analysis with Multi-head Attention Networks
    Wang, Weimin
    Wang, Weiran
    Sun, Ming
    Wang, Chao
    INTERSPEECH 2020, 2020, : 1191 - 1195
  • [9] Multi-head Attention Networks for Nonintrusive Load Monitoring
    Lin, Nan
    Zhou, Binggui
    Yang, Guanghua
    Ma, Shaodan
    2020 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (IEEE ICSPCC 2020), 2020,
  • [10] A fiber recognition framework based on multi-head attention mechanism
    Xu, Luoli
    Li, Fenying
    Chang, Shan
    TEXTILE RESEARCH JOURNAL, 2024, : 2629 - 2640