Semantic-guided spatio-temporal attention for few-shot action recognition

被引:1
|
作者
Wang, Jianyu [1 ]
Liu, Baolin [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
Few-shot action recognition; Semantic-guided attention mechanism; Multimodal learning; Sequence matching; NETWORKS;
D O I
10.1007/s10489-024-05294-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.
引用
收藏
页码:2458 / 2471
页数:14
相关论文
共 50 条
  • [21] Spatio-temporal Semantic Features for Human Action Recognition
    Liu, Jia
    Wang, Xiaonian
    Li, Tianyu
    Yang, Jie
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2012, 6 (10): : 2632 - 2649
  • [22] Interpretable Spatio-temporal Attention for Video Action Recognition
    Meng, Lili
    Zhao, Bo
    Chang, Bo
    Huang, Gao
    Sun, Wei
    Tung, Frederich
    Sigal, Leonid
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1513 - 1522
  • [23] Spatio-Temporal Attention Networks for Action Recognition and Detection
    Li, Jun
    Liu, Xianglong
    Zhang, Wenxuan
    Zhang, Mingyuan
    Song, Jingkuan
    Sebe, Nicu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (11) : 2990 - 3001
  • [24] Hybrid Relation Guided Set Matching for Few-shot Action Recognition
    Wang, Xiang
    Zhang, Shiwei
    Qing, Zhiwu
    Tang, Mingqian
    Zuo, Zhengrong
    Gao, Changxin
    Jin, Rong
    Sang, Nong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19916 - 19925
  • [25] CLIP-guided Prototype Modulating for Few-shot Action Recognition
    Wang, Xiang
    Zhang, Shiwei
    Cen, Jun
    Gao, Changxin
    Zhang, Yingya
    Zhao, Deli
    Sang, Nong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 1899 - 1912
  • [26] Semantic Prompt for Few-Shot Image Recognition
    Chen, Wentao
    Si, Chenyang
    Zhang, Zhang
    Wang, Liang
    Wang, Zilei
    Tan, Tieniu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23581 - 23591
  • [27] Knowledge-Guided Semantic Transfer Network for Few-Shot Image Recognition
    Li, Zechao
    Tang, Hao
    Peng, Zhimao
    Qi, Guo-Jun
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 15
  • [28] Dynamic Temporal Shift Feature Enhancement for Few-Shot Action Recognition
    Li, Haibo
    Zhang, Bingbing
    Ma, Yuanchen
    Guo, Qiang
    Zhang, Jianxin
    Zhang, Qiang
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 471 - 484
  • [29] TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition
    Ben-Ari, Rami
    Nacson, Mor Shpigel
    Azulai, Ophir
    Barzelay, Udi
    Rotman, Daniel
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2780 - 2788
  • [30] HyRSM plus plus : Hybrid relation guided temporal set matching for few-shot action recognition
    Wang, Xiang
    Zhang, Shiwei
    Qing, Zhiwu
    Zuo, Zhengrong
    Gao, Changxin
    Jin, Rong
    Sang, Nong
    PATTERN RECOGNITION, 2024, 147