Multimodal action recognition: a comprehensive survey on temporal modeling

被引:1
|
作者
Shabaninia, Elham [1 ,2 ]
Nezamabadi-pour, Hossein [2 ]
Shafizadegan, Fatemeh [3 ]
机构
[1] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman 7631818356, Iran
[2] Shahid Bahonar Univ Kerman, Dept Elect Engn, Kerman 76169133, Iran
[3] Univ Isfahan, Dept Comp Engn, Esfahan 8174673441, Iran
基金
美国国家科学基金会;
关键词
Temporal modeling; Action recognition; Deep learning; Transformer; NEURAL-NETWORKS; ATTENTION; LSTM; VISION; FUSION; CLASSIFICATION;
D O I
10.1007/s11042-023-17345-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In action recognition that relies on visual information, activities are recognized through spatio-temporal features from different modalities. The challenge of temporal modeling has been a long-standing issue in this field. There are a limited number of methods, such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNNs), that are used in deep-based approaches to model motion information. However, the success of transformers in modeling long-range dependencies in natural language processing tasks has recently caught the attention of other domains, including speech, image, and video, as they can rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic in the last few years is impressive. This paper aims to review recent progress in deep learning methods for modeling temporal variations in multimodal human action recognition. Specifically, it focuses on methods that use transformers for temporal modeling, highlighting their key features and the modalities they employ, while also identifying opportunities and challenges for future research.
引用
收藏
页码:59439 / 59489
页数:51
相关论文
共 50 条
  • [1] A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data
    Ye, Jun
    Hu, Hao
    Qi, Guo-Jun
    Hua, Kien A.
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2017, 13 (02) : 1 - 22
  • [2] Multimodal human action recognition based on spatio-temporal action representation recognition model
    Wu, Qianhan
    Huang, Qian
    Li, Xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (11) : 16409 - 16430
  • [3] A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector
    Debapratim Das Dawn
    Soharab Hossain Shaikh
    The Visual Computer, 2016, 32 : 289 - 306
  • [4] Multimodal human action recognition based on spatio-temporal action representation recognition model
    Qianhan Wu
    Qian Huang
    Xing Li
    Multimedia Tools and Applications, 2023, 82 : 16409 - 16430
  • [5] A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector
    Das Dawn, Debapratim
    Shaikh, Soharab Hossain
    VISUAL COMPUTER, 2016, 32 (03): : 289 - 306
  • [6] From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
    Shaikh, Muhammad bilal
    Chai, Douglas
    Islam, Syed Muhammad Shamsul
    Akhtar, Naveed
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (08)
  • [7] Dilated Multi-Temporal Modeling for Action Recognition
    Zhang, Tao
    Wu, Yifan
    Li, Xiaoqiang
    APPLIED SCIENCES-BASEL, 2023, 13 (12):
  • [8] Cluster-guided temporal modeling for action recognition
    Kim, Jeong-Hun
    Hao, Fei
    Leung, Carson Kai-Sang
    Nasridinov, Aziz
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (02)
  • [9] Cluster-guided temporal modeling for action recognition
    Jeong-Hun Kim
    Fei Hao
    Carson Kai-Sang Leung
    Aziz Nasridinov
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [10] Feature Encodings and Poolings for Action and Event Recognition: A Comprehensive Survey
    Liu, Changyu
    Zhang, Qian
    Lu, Bin
    Li, Cong
    INFORMATION, 2017, 8 (04)