Multimodal action recognition: a comprehensive survey on temporal modeling

被引:1
|
作者
Shabaninia, Elham [1 ,2 ]
Nezamabadi-pour, Hossein [2 ]
Shafizadegan, Fatemeh [3 ]
机构
[1] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman 7631818356, Iran
[2] Shahid Bahonar Univ Kerman, Dept Elect Engn, Kerman 76169133, Iran
[3] Univ Isfahan, Dept Comp Engn, Esfahan 8174673441, Iran
基金
美国国家科学基金会;
关键词
Temporal modeling; Action recognition; Deep learning; Transformer; NEURAL-NETWORKS; ATTENTION; LSTM; VISION; FUSION; CLASSIFICATION;
D O I
10.1007/s11042-023-17345-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In action recognition that relies on visual information, activities are recognized through spatio-temporal features from different modalities. The challenge of temporal modeling has been a long-standing issue in this field. There are a limited number of methods, such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNNs), that are used in deep-based approaches to model motion information. However, the success of transformers in modeling long-range dependencies in natural language processing tasks has recently caught the attention of other domains, including speech, image, and video, as they can rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic in the last few years is impressive. This paper aims to review recent progress in deep learning methods for modeling temporal variations in multimodal human action recognition. Specifically, it focuses on methods that use transformers for temporal modeling, highlighting their key features and the modalities they employ, while also identifying opportunities and challenges for future research.
引用
收藏
页码:59439 / 59489
页数:51
相关论文
共 50 条
  • [41] A BERT-Based Joint Channel-Temporal Modeling for Action Recognition
    Yang, Man
    Gan, Lipeng
    Cao, Runze
    Li, Xiaochao
    IEEE SENSORS JOURNAL, 2023, 23 (19) : 23765 - 23779
  • [42] Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition
    Chen, Meng
    Gong, Liyu
    Wang, Tianjiang
    Liu, Fang
    Feng, Qi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (17) : 10335 - 10355
  • [43] Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition
    Meng Chen
    Liyu Gong
    Tianjiang Wang
    Fang Liu
    Qi Feng
    Multimedia Tools and Applications, 2016, 75 : 10335 - 10355
  • [44] Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
    Xiang, Wangmeng
    Li, Chao
    Wang, Biao
    Wei, Xihan
    Hua, Xian-Sheng
    Zhang, Lei
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 627 - 644
  • [45] VIDEO ACTION RECOGNITION WITH SPATIO-TEMPORAL GRAPH EMBEDDING AND SPLINE MODELING
    Yuan, Yin
    Zheng, Haomian
    Li, Zhu
    Zhang, David
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2422 - 2425
  • [46] Efficient 2D Temporal Modeling Network for Video Action Recognition
    Li, Zhilei
    Li, Jun
    Shi, Zhiping
    Jiang, Na
    Zhang, Yongkang
    Computer Engineering and Applications, 2024, 59 (03) : 127 - 134
  • [47] Spatio-temporal Relation Modeling for Few-shot Action Recognition
    Thatipelli, Anirudh
    Narayan, Sanath
    Khan, Salman
    Anwer, Rao Muhammad
    Khan, Fahad Shahbaz
    Ghanem, Bernard
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19926 - 19935
  • [48] A Survey on Temporal Action Localization
    Xia, Huifen
    Zhan, Yongzhao
    IEEE ACCESS, 2020, 8 : 70477 - 70487
  • [49] A Comprehensive Survey of RGB-Based and Skeleton-Based Human Action Recognition
    Wang, Cailing
    Yan, Jingjing
    IEEE ACCESS, 2023, 11 : 53880 - 53898
  • [50] The Recognition of the Importance of Comprehensive Modeling
    Liu, Qiang
    9TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1 AND 2, 2008, : 1015 - 1017