Hybrid embedding for multimodal few-frame action recognition

被引:0
|
作者
Shafizadegan, Fatemeh [1 ]
Naghsh-Nilchi, Ahmad Reza [1 ]
Shabaninia, Elham [2 ]
机构
[1] Univ Isfahan, Fac Comp Engn, Dept Artificial Intelligence Engn, Esfahan, Iran
[2] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman, Iran
关键词
Action recognition; Vision transformer; Few-frame; Hybrid embedding;
D O I
10.1007/s00530-025-01676-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, action recognition has witnessed significant advancements. However, most existing approaches heavily depend on the availability of large amounts of video data, which can be computationally expensive and time-consuming to process especially in real-time applications with limited computational resources. Utilizing too few frames instead, may lead to the loss of crucial information. Therefore, selecting a few frames in a way that preserves essential information poses a challenge. To address this issue, this paper proposes a novel video clip embedding technique called Hybrid Embedding. This technique combines the advantages of uniform frame sampling and tubelet embedding to enhance recognition with few frames. By employing a transformer-based architecture, the approach captures both spatial and temporal information from limited video frames. Furthermore, a keyframe extraction method is introduced to select more informative and diverse frames, which is crucial when only a few frames are available. In addition, the region of interest (ROI) in each RGB frame is cropped using skeletal data to enhance spatial attention. The study also explores the impact of the number of frames, different modalities, various transformer models, and the effect of pretraining in few-frame human action recognition. Experimental results demonstrate the effectiveness of the proposed embedding technique in few-frame action recognition. These findings contribute to addressing the challenge of action recognition with limited frames and shed light on the potential of transformers in this domain.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Recognition of action dynamics in fencing using multimodal cues
    Malawski, Filip
    Kwolek, Bogdan
    IMAGE AND VISION COMPUTING, 2018, 75 : 1 - 10
  • [42] Landmark-based multimodal human action recognition
    Asteriadis, Stylianos
    Daras, Petros
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (03) : 4505 - 4521
  • [43] Multimodal action recognition: a comprehensive survey on temporal modeling
    Shabaninia, Elham
    Nezamabadi-pour, Hossein
    Shafizadegan, Fatemeh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 59439 - 59489
  • [44] Human action recognition using Kinect multimodal information
    Tang, Chao
    Zhang, Miao-hui
    Wang, Xiao-feng
    Li, Wei
    Cao, Feng
    Hu, Chun-ling
    GLOBAL INTELLIGENCE INDUSTRY CONFERENCE (GIIC 2018), 2018, 10835
  • [45] System for multimodal data acquisition for human action recognition
    Filip Malawski
    Jakub Gałka
    Multimedia Tools and Applications, 2018, 77 : 23825 - 23850
  • [46] System for multimodal data acquisition for human action recognition
    Malawski, Filip
    Galka, Jakub
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23825 - 23850
  • [47] Adaptive Multimodal Fusion for Facial Action Units Recognition
    Yang, Huiyuan
    Wang, Taoyue
    Yin, Lijun
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2982 - 2990
  • [48] Distillation Multiple Choice Learning for Multimodal Action Recognition
    Garcia, Nuno Cruz
    Bargal, Sarah Adel
    Ablavsky, Vitaly
    Morerio, Pietro
    Murino, Vittorio
    Sclaroff, Stan
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 2754 - 2763
  • [49] Multimodal Multipart Learning for Action Recognition in Depth Videos
    Shahroudy, Amir
    Ng, Tian-Tsong
    Yang, Qingxiong
    Wang, Gang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (10) : 2123 - 2129
  • [50] Multimodal emotion recognition based on peak frame selection from video
    Zhalehpour, Sara
    Akhtar, Zahid
    Erdem, Cigdem Eroglu
    SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (05) : 827 - 834