Semantic-guided spatio-temporal attention for few-shot action recognition

被引:1
|
作者
Wang, Jianyu [1 ]
Liu, Baolin [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing 100083, Peoples R China
基金
中国国家自然科学基金;
关键词
Few-shot action recognition; Semantic-guided attention mechanism; Multimodal learning; Sequence matching; NETWORKS;
D O I
10.1007/s10489-024-05294-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.
引用
收藏
页码:2458 / 2471
页数:14
相关论文
共 50 条
  • [31] Spatio-Temporal Graph Few-Shot Learning with Cross-City Knowledge Transfer
    Lu, Bin
    Gan, Xiaoying
    Zhang, Weinan
    Yao, Huaxiu
    Fu, Luoyi
    Wang, Xinbing
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 1162 - 1172
  • [32] Unified Spatio-Temporal Attention Networks for Action Recognition in Videos
    Li, Dong
    Yao, Ting
    Duan, Ling-Yu
    Mei, Tao
    Rui, Yong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (02) : 416 - 428
  • [33] STCA: an action recognition network with spatio-temporal convolution and attention
    Tian, Qiuhong
    Miao, Weilun
    Zhang, Lizao
    Yang, Ziyu
    Yu, Yang
    Zhao, Yanying
    Yao, Lan
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2025, 14 (01)
  • [34] Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching
    Xing, Jiazheng
    Wang, Mengmeng
    Ruan, Yudi
    Chen, Bofan
    Guo, Yaowei
    Mu, Boyu
    Dai, Guang
    Wang, Jingdong
    Liu, Yong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1740 - 1750
  • [35] A statistical framework for few-shot action recognition
    Mark Haddad
    Vahid K. Ghassab
    Fatma Najar
    Nizar Bouguila
    Multimedia Tools and Applications, 2021, 80 : 24303 - 24318
  • [36] Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition
    Kim, Seon-Bin
    Jung, Chanhyuk
    Kim, Byeong-Il
    Ko, Byoung Chul
    SENSORS, 2022, 22 (23)
  • [37] A statistical framework for few-shot action recognition
    Haddad, Mark
    Ghassab, Vahid K.
    Najar, Fatma
    Bouguila, Nizar
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (16) : 24303 - 24318
  • [38] Few-Shot Temporal Sentence Grounding via Memory-Guided Semantic Learning
    Liu, Daizong
    Zhou, Pan
    Xu, Zichuan
    Wang, Haozhao
    Li, Ruixuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2491 - 2505
  • [39] Few-shot semantic segmentation for industrial defect recognition
    Shi, Xiangwen
    Zhang, Shaobing
    Cheng, Miao
    He, Lian
    Tang, Xianghong
    Cui, Zhe
    COMPUTERS IN INDUSTRY, 2023, 148
  • [40] MULTI-SCALE TEMPORAL FEATURE FUSION FOR FEW-SHOT ACTION RECOGNITION
    Lee, Jun-Tae
    Yun, Sungrack
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1785 - 1789