Hybrid embedding for multimodal few-frame action recognition

被引:0
|
作者
Shafizadegan, Fatemeh [1 ]
Naghsh-Nilchi, Ahmad Reza [1 ]
Shabaninia, Elham [2 ]
机构
[1] Univ Isfahan, Fac Comp Engn, Dept Artificial Intelligence Engn, Esfahan, Iran
[2] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman, Iran
关键词
Action recognition; Vision transformer; Few-frame; Hybrid embedding;
D O I
10.1007/s00530-025-01676-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, action recognition has witnessed significant advancements. However, most existing approaches heavily depend on the availability of large amounts of video data, which can be computationally expensive and time-consuming to process especially in real-time applications with limited computational resources. Utilizing too few frames instead, may lead to the loss of crucial information. Therefore, selecting a few frames in a way that preserves essential information poses a challenge. To address this issue, this paper proposes a novel video clip embedding technique called Hybrid Embedding. This technique combines the advantages of uniform frame sampling and tubelet embedding to enhance recognition with few frames. By employing a transformer-based architecture, the approach captures both spatial and temporal information from limited video frames. Furthermore, a keyframe extraction method is introduced to select more informative and diverse frames, which is crucial when only a few frames are available. In addition, the region of interest (ROI) in each RGB frame is cropped using skeletal data to enhance spatial attention. The study also explores the impact of the number of frames, different modalities, various transformer models, and the effect of pretraining in few-frame human action recognition. Experimental results demonstrate the effectiveness of the proposed embedding technique in few-frame action recognition. These findings contribute to addressing the challenge of action recognition with limited frames and shed light on the potential of transformers in this domain.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos
    Zhang, Zhaoyang
    Kuang, Zhanghui
    Luo, Ping
    Feng, Litong
    Zhang, Wei
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 257 - 264
  • [2] Regressive Gaussian Process Latent Variable Model for Few-Frame Human Motion Prediction
    Jin, Xin
    Guo, Jia
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (10) : 1621 - 1626
  • [3] Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
    Wanyan, Yuyang
    Yang, Xiaoshan
    Chen, Chaofan
    Xu, Changsheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6492 - 6502
  • [4] TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition
    Ben-Ari, Rami
    Nacson, Mor Shpigel
    Azulai, Ophir
    Barzelay, Udi
    Rotman, Daniel
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 2780 - 2788
  • [5] Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition
    Ni, Xinzhe
    Liu, Yong
    Wen, Hao
    Ji, Yatai
    Xiao, Jing
    Yang, Yujiu
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1 - 10
  • [6] Multimodal Preserving Embedding for Face Recognition
    Wang, Ying
    Pan, Chunhong
    Wang, Haitao
    2008 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2008), VOLS 1 AND 2, 2008, : 546 - +
  • [7] FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition
    Yu, Bin
    Hou, Yonghong
    Guo, Zihui
    Gao, Zhiyi
    Li, Yueyang
    IMAGE AND VISION COMPUTING, 2024, 149
  • [8] Hybrid attentive prototypical network for few-shot action recognition
    Ruan, Zanxi
    Wei, Yingmei
    Guo, Yanming
    Xie, Yuxiang
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (06) : 8249 - 8272
  • [9] Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
    Hatano, Masashi
    Hachiuma, Ryo
    Fujii, Ryo
    Saito, Hideo
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 182 - 199
  • [10] A Novel Action Transformer Network for Hybrid Multimodal Sign Language Recognition
    Javaid, Sameena
    Rizvi, Safdar
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 523 - 537