Semantic-guided spatio-temporal attention for few-shot action recognition

被引:1
|
作者
Wang, Jianyu [1 ]
Liu, Baolin [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
Few-shot action recognition; Semantic-guided attention mechanism; Multimodal learning; Sequence matching; NETWORKS;
D O I
10.1007/s10489-024-05294-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.
引用
收藏
页码:2458 / 2471
页数:14
相关论文
共 50 条
  • [41] Few-shot action recognition with implicit temporal alignment and pair similarity optimization
    Cao, Congqi
    Li, Yajuan
    Lv, Qinyi
    Wang, Peng
    Zhang, Yanning
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
  • [42] Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition
    Wang, Lei
    Koniusz, Piotr
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 307 - 326
  • [43] Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition
    Huang, Siteng
    Zhang, Min
    Kang, Yachen
    Wang, Donglin
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7840 - 7847
  • [44] Boosting Few-shot visual recognition via saliency-guided complementary attention
    Zhao, Linglan
    Liu, Ge
    Guo, Dashan
    Li, Wei
    Fang, Xiangzhong
    NEUROCOMPUTING, 2022, 507 : 412 - 427
  • [45] Few-Shot Few-Shot Learning and the role of Spatial Attention
    Lifchitz, Yann
    Avrithis, Yannis
    Picard, Sylvaine
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 2693 - 2700
  • [46] nHi-SEGA: n-Hierarchy SEmantic Guided Attention for few-shot learning
    Yuan, Xinpan
    Xie, Shaojun
    Zeng, Zhigao
    Li, Changyun
    Wang, Luda
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (06) : 7577 - 7589
  • [47] Few-shot human motion prediction using deformable spatio-temporal CNN with parameter generation
    Zang, Chuanqi
    Li, Menghao
    Pei, Mingtao
    NEUROCOMPUTING, 2022, 513 : 46 - 58
  • [48] SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification
    Peng, Fang
    Yang, Xiaoshan
    Xiao, Linhui
    Wang, Yaowei
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3469 - 3480
  • [49] Fluxformer: Flow-Guided Duplex Attention Transformer via Spatio-Temporal Clustering for Action Recognition
    Hong, Younggi
    Kim, Min Ju
    Lee, Isack
    Yoo, Seok Bong
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6411 - 6418
  • [50] A Dual Attention Network with Semantic Embedding for Few-Shot Learning
    Yan, Shipeng
    Zhang, Songyang
    He, Xuming
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9079 - 9086