Multimodal action recognition: a comprehensive survey on temporal modeling

被引:1
|
作者
Shabaninia, Elham [1 ,2 ]
Nezamabadi-pour, Hossein [2 ]
Shafizadegan, Fatemeh [3 ]
机构
[1] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman 7631818356, Iran
[2] Shahid Bahonar Univ Kerman, Dept Elect Engn, Kerman 76169133, Iran
[3] Univ Isfahan, Dept Comp Engn, Esfahan 8174673441, Iran
基金
美国国家科学基金会;
关键词
Temporal modeling; Action recognition; Deep learning; Transformer; NEURAL-NETWORKS; ATTENTION; LSTM; VISION; FUSION; CLASSIFICATION;
D O I
10.1007/s11042-023-17345-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In action recognition that relies on visual information, activities are recognized through spatio-temporal features from different modalities. The challenge of temporal modeling has been a long-standing issue in this field. There are a limited number of methods, such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNNs), that are used in deep-based approaches to model motion information. However, the success of transformers in modeling long-range dependencies in natural language processing tasks has recently caught the attention of other domains, including speech, image, and video, as they can rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic in the last few years is impressive. This paper aims to review recent progress in deep learning methods for modeling temporal variations in multimodal human action recognition. Specifically, it focuses on methods that use transformers for temporal modeling, highlighting their key features and the modalities they employ, while also identifying opportunities and challenges for future research.
引用
收藏
页码:59439 / 59489
页数:51
相关论文
共 50 条
  • [11] A comprehensive multimodal eye recognition
    Zhi Zhou
    Eliza Y. Du
    N. Luke Thomas
    Edward J. Delp
    Signal, Image and Video Processing, 2013, 7 : 619 - 631
  • [12] A comprehensive multimodal eye recognition
    Zhou, Zhi
    Du, Eliza Y.
    Thomas, N. Luke
    Delp, Edward J.
    SIGNAL IMAGE AND VIDEO PROCESSING, 2013, 7 (04) : 619 - 631
  • [13] Graph-Based Methods for Multimodal Indoor Activity Recognition: A Comprehensive Survey
    Javadi, Saeedeh
    Riboni, Daniele
    Borzi, Luigi
    Zolfaghari, Samaneh
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
  • [14] Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition
    Liu, Qiang
    Chen, Enqing
    Gao, Lei
    Liang, Chengwu
    Liu, Hao
    SENSORS, 2020, 20 (17) : 1 - 17
  • [15] Temporal Modeling on Multi-Temporal-Scale Spatiotemporal Atoms for Action Recognition
    Yao, Guangle
    Lei, Tao
    Liu, Xianyuan
    Jiang, Ping
    APPLIED SCIENCES-BASEL, 2018, 8 (10):
  • [16] Long-Short Temporal Modeling for Efficient Action Recognition
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021, 2021-June : 2435 - 2439
  • [17] LONG-SHORT TEMPORAL MODELING FOR EFFICIENT ACTION RECOGNITION
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2435 - 2439
  • [18] Hierarchical Spatio-Temporal Context Modeling for Action Recognition
    Sun, Ju
    Wu, Xiao
    Yan, Shuicheng
    Cheong, Loong-Fah
    Chua, Tat-Seng
    Li, Jintao
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 2004 - +
  • [19] A Comprehensive Survey of Vision-Based Human Action Recognition Methods
    Zhang, Hong-Bo
    Zhang, Yi-Xiang
    Zhong, Bineng
    Lei, Qing
    Yang, Lijie
    Du, Ji-Xiang
    Chen, Duan-Sheng
    SENSORS, 2019, 19 (05)
  • [20] Graph Convolutional Neural Network for Human Action Recognition: A Comprehensive Survey
    Ahmad T.
    Jin L.
    Zhang X.
    Lai S.
    Tang G.
    Lin L.
    IEEE Transactions on Artificial Intelligence, 2021, 2 (02): : 128 - 145