From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

被引:2
|
作者
Shaikh, Muhammad bilal [1 ,2 ]
Chai, Douglas [2 ]
Islam, Syed Muhammad Shamsul [3 ]
Akhtar, Naveed [4 ]
机构
[1] Edith Cowan Univ, Sch Engn, Joondalup, WA, Australia
[2] Molycop, Balcatta, WA, Australia
[3] Edith Cowan Univ, Sch Sci, Syed Muhammad Shamsul Islam, Joondalup, WA, Australia
[4] Univ Melbourne, Melbourne, Vic, Australia
关键词
Multimodal; action recognition; fusion; deep learning; neural networks; RGB-D; FUSION; STREAMS;
D O I
10.1145/3664815
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the past decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of 'fusing' the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Human Action Recognition with Transformers
    Mazzeo, Pier Luigi
    Spagnolo, Paolo
    Fasano, Matteo
    Distante, Cosimo
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT III, 2022, 13233 : 230 - 241
  • [2] Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition
    Wu, Hanbo
    Ma, Xin
    Li, Yibin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1250 - 1261
  • [3] A Survey on Human Action Recognition from Videos
    Dhamsania, Chandni J.
    Ratanpara, Tushar V.
    PROCEEDINGS OF 2016 ONLINE INTERNATIONAL CONFERENCE ON GREEN ENGINEERING AND TECHNOLOGIES (IC-GET), 2016,
  • [4] Scaling up The Training of Deep CNNs for Human Action Recognition
    Rajeswar, M. Sai
    Sankar, A. Ravi
    Balasubramanian, Vineeth N.
    Sudheer, C. D.
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 2015, : 1172 - 1177
  • [5] REALISTIC HUMAN ACTION RECOGNITION: WHEN CNNS MEET LDS
    Zhang, Lei
    Feng, Yangyang
    Xiang, Xuezhi
    Zhen, Xiantong
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 1622 - 1626
  • [6] Multimodal Learning With Transformers: A Survey
    Xu, Peng
    Zhu, Xiatian
    Clifton, David A.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 12113 - 12132
  • [7] Human Action Recognition: A Survey
    Fu, Meixia
    Chen, Na
    Huang, Zhongjie
    Ni, Kaili
    Liu, Yuhao
    Sun, Songlin
    Ma, Xiaomei
    SIGNAL AND INFORMATION PROCESSING, NETWORKING AND COMPUTERS (ICSINC), 2019, 550 : 69 - 77
  • [8] Multimodal action recognition: a comprehensive survey on temporal modeling
    Shabaninia, Elham
    Nezamabadi-pour, Hossein
    Shafizadegan, Fatemeh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 59439 - 59489
  • [9] Multiresolution and Multimodal Speech Recognition with Transformers
    Paraskevopoulos, Georgios
    Parthasarathy, Srinivas
    Khare, Aparna
    Sundaram, Shiva
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2381 - 2387
  • [10] 3D CNNs on Distance Matrices for Human Action Recognition
    Hernandez Ruiz, Alejandro
    Porzi, Lorenzo
    Bulo, Samuel Rota
    Moreno-Noguer, Francesc
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1087 - 1095