Multimodal action recognition: a comprehensive survey on temporal modeling

被引：1

作者：

Shabaninia, Elham ^{[1
,2
]}

Nezamabadi-pour, Hossein ^{[2
]}

Shafizadegan, Fatemeh ^{[3
]}

机构：

[1] Grad Univ Adv Technol, Fac Sci & Modern Technol, Dept Appl Math, Kerman 7631818356, Iran

[2] Shahid Bahonar Univ Kerman, Dept Elect Engn, Kerman 76169133, Iran

[3] Univ Isfahan, Dept Comp Engn, Esfahan 8174673441, Iran

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 83卷 / 20期

基金：

美国国家科学基金会;

关键词：

Temporal modeling; Action recognition; Deep learning; Transformer; NEURAL-NETWORKS; ATTENTION; LSTM; VISION; FUSION; CLASSIFICATION;

D O I：

10.1007/s11042-023-17345-y

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In action recognition that relies on visual information, activities are recognized through spatio-temporal features from different modalities. The challenge of temporal modeling has been a long-standing issue in this field. There are a limited number of methods, such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNNs), that are used in deep-based approaches to model motion information. However, the success of transformers in modeling long-range dependencies in natural language processing tasks has recently caught the attention of other domains, including speech, image, and video, as they can rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic in the last few years is impressive. This paper aims to review recent progress in deep learning methods for modeling temporal variations in multimodal human action recognition. Specifically, it focuses on methods that use transformers for temporal modeling, highlighting their key features and the modalities they employ, while also identifying opportunities and challenges for future research.

引用

页码：59439 / 59489

页数：51

共 50 条

[41] A BERT-Based Joint Channel-Temporal Modeling for Action Recognition
Yang, Man
Gan, Lipeng
Cao, Runze
Li, Xiaochao
IEEE SENSORS JOURNAL, 2023, 23 (19) : 23765 - 23779
[42] Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition
Chen, Meng
Gong, Liyu
Wang, Tianjiang
Liu, Fang
Feng, Qi
MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (17) : 10335 - 10355
[43] Modeling spatio-temporal layout with Lie Algebrized Gaussians for action recognition
Meng Chen
Liyu Gong
Tianjiang Wang
Fang Liu
Qi Feng
Multimedia Tools and Applications, 2016, 75 : 10335 - 10355
[44] Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Xiang, Wangmeng
Li, Chao
Wang, Biao
Wei, Xihan
Hua, Xian-Sheng
Zhang, Lei
COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 627 - 644
[45] VIDEO ACTION RECOGNITION WITH SPATIO-TEMPORAL GRAPH EMBEDDING AND SPLINE MODELING
Yuan, Yin
Zheng, Haomian
Li, Zhu
Zhang, David
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2422 - 2425
[46] Efficient 2D Temporal Modeling Network for Video Action Recognition
Li, Zhilei
Li, Jun
Shi, Zhiping
Jiang, Na
Zhang, Yongkang
Computer Engineering and Applications, 2024, 59 (03) : 127 - 134
[47] Spatio-temporal Relation Modeling for Few-shot Action Recognition
Thatipelli, Anirudh
Narayan, Sanath
Khan, Salman
Anwer, Rao Muhammad
Khan, Fahad Shahbaz
Ghanem, Bernard
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19926 - 19935
[48] A Survey on Temporal Action Localization
Xia, Huifen
Zhan, Yongzhao
IEEE ACCESS, 2020, 8 : 70477 - 70487
[49] A Comprehensive Survey of RGB-Based and Skeleton-Based Human Action Recognition
Wang, Cailing
Yan, Jingjing
IEEE ACCESS, 2023, 11 : 53880 - 53898
[50] The Recognition of the Importance of Comprehensive Modeling
Liu, Qiang
9TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1 AND 2, 2008, : 1015 - 1017

← 1 2 3 4 5 →