Hybrid embedding for multimodal few-frame action recognition

被引:0
|
作者
Shafizadegan, Fatemeh [1 ]
Naghsh-Nilchi, Ahmad Reza [1 ]
Shabaninia, Elham [2 ]
机构
[1] Univ Isfahan, Fac Comp Engn, Dept Artificial Intelligence Engn, Esfahan, Iran
[2] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman, Iran
关键词
Action recognition; Vision transformer; Few-frame; Hybrid embedding;
D O I
10.1007/s00530-025-01676-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, action recognition has witnessed significant advancements. However, most existing approaches heavily depend on the availability of large amounts of video data, which can be computationally expensive and time-consuming to process especially in real-time applications with limited computational resources. Utilizing too few frames instead, may lead to the loss of crucial information. Therefore, selecting a few frames in a way that preserves essential information poses a challenge. To address this issue, this paper proposes a novel video clip embedding technique called Hybrid Embedding. This technique combines the advantages of uniform frame sampling and tubelet embedding to enhance recognition with few frames. By employing a transformer-based architecture, the approach captures both spatial and temporal information from limited video frames. Furthermore, a keyframe extraction method is introduced to select more informative and diverse frames, which is crucial when only a few frames are available. In addition, the region of interest (ROI) in each RGB frame is cropped using skeletal data to enhance spatial attention. The study also explores the impact of the number of frames, different modalities, various transformer models, and the effect of pretraining in few-frame human action recognition. Experimental results demonstrate the effectiveness of the proposed embedding technique in few-frame action recognition. These findings contribute to addressing the challenge of action recognition with limited frames and shed light on the potential of transformers in this domain.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Dense Dilated Network for Few Shot Action Recognition
    Xu, Baohan
    Ye, Hao
    Zheng, Yingbin
    Wang, Heng
    Luwang, Tianyu
    Jiang, Yu-Gang
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 379 - 387
  • [32] MULTIMODAL EMBEDDING FUSION FOR ROBUST SPEAKER ROLE RECOGNITION IN VIDEO BROADCAST
    Rouvier, Mickael
    Delecraz, Sebastien
    Favre, Benoit
    Bendris, Meriem
    Bechet, Frederic
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 383 - 389
  • [33] Temporal Hallucinating for Action Recognition with Few Still Images
    Wang, Yali
    Zhou, Lei
    Qiao, Yu
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5314 - 5322
  • [34] A statistical framework for few-shot action recognition
    Mark Haddad
    Vahid K. Ghassab
    Fatma Najar
    Nizar Bouguila
    Multimedia Tools and Applications, 2021, 80 : 24303 - 24318
  • [35] Multimodal human action recognition based on spatio-temporal action representation recognition model
    Wu, Qianhan
    Huang, Qian
    Li, Xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (11) : 16409 - 16430
  • [36] A statistical framework for few-shot action recognition
    Haddad, Mark
    Ghassab, Vahid K.
    Najar, Fatma
    Bouguila, Nizar
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (16) : 24303 - 24318
  • [37] Multimodal human action recognition based on spatio-temporal action representation recognition model
    Qianhan Wu
    Qian Huang
    Xing Li
    Multimedia Tools and Applications, 2023, 82 : 16409 - 16430
  • [38] Action density based frame sampling for human action recognition in videos
    Lin, Jie
    Mu, Zekun
    Zhao, Tianqing
    Zhang, Hanlin
    Yang, Xinyu
    Zhao, Peng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 90
  • [39] Landmark-based multimodal human action recognition
    Stylianos Asteriadis
    Petros Daras
    Multimedia Tools and Applications, 2017, 76 : 4505 - 4521
  • [40] Multimodal integration for meeting group action segmentation and recognition
    Al-Hames, M
    Dielmann, A
    Gatica-Perez, D
    Reiter, S
    Renals, S
    Rigoll, G
    Zhang, D
    MACHINE LEARNING FOR MULTIMODAL INTERACTION, 2005, 3869 : 52 - 63