Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

被引:49
|
作者
Moutik, Oumaima [1 ]
Sekkat, Hiba [1 ]
Tigani, Smail [1 ]
Chehri, Abdellah [2 ]
Saadane, Rachid [3 ]
Tchakoucht, Taha Ait [1 ]
Paul, Anand [4 ]
机构
[1] Euro Mediterranean Univ, Euromed Res Ctr, Engn Unit, Fes 30030, Morocco
[2] Royal Mil Coll Canada, Dept Math & Comp Sci, Kingston, ON K7K 7B4, Canada
[3] Hassania Sch Publ Works, SIRC LaGeS, Casablanca 8108, Morocco
[4] Kyungpook Natl Univ, Sch Comp Sci & Engn, Daegu 41566, South Korea
关键词
convolutional neural networks; vision transformers; recurrent neural networks; conversational systems; action recognition; natural language understanding; action recognitions; COMPUTER VISION; ATTENTION;
D O I
10.3390/s23020734
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] CMT: Convolutional Neural Networks Meet Vision Transformers
    Guo, Jianyuan
    Han, Kai
    Wu, Han
    Tang, Yehui
    Chen, Xinghao
    Wang, Yunhe
    Xu, Chang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12165 - 12175
  • [2] Visualization Comparison of Vision Transformers and Convolutional Neural Networks
    Shi, Rui
    Li, Tianxing
    Zhang, Liguo
    Yamaguchi, Yasushi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2327 - 2339
  • [3] Deepfake detection using convolutional vision transformers and convolutional neural networks
    Soudy, Ahmed Hatem
    Sayed, Omnia
    Tag-Elser, Hala
    Ragab, Rewaa
    Mohsen, Sohaila
    Mostafa, Tarek
    Abohany, Amr A.
    Slim, Salwa O.
    Neural Computing and Applications, 2024, 36 (31) : 19759 - 19775
  • [4] Adversarial Robustness of Vision Transformers Versus Convolutional Neural Networks
    Ali, Kazim
    Bhatti, Muhammad Shahid
    Saeed, Atif
    Athar, Atifa
    Al Ghamdi, Mohammed A.
    Almotiri, Sultan H.
    Akram, Samina
    IEEE ACCESS, 2024, 12 : 105281 - 105293
  • [5] Do Vision Transformers See Like Convolutional Neural Networks?
    Raghu, Maithra
    Unterthiner, Thomas
    Kornblith, Simon
    Zhang, Chiyuan
    Dosovitskiy, Alexey
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
    Rao, Yongming
    Liu, Zuyan
    Zhao, Wenliang
    Zhou, Jie
    Lu, Jiwen
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (09) : 10883 - 10897
  • [7] Understanding and improving adversarial transferability of vision transformers and convolutional neural networks
    Chen, Zhiyu
    Xu, Chi
    Lv, Huanhuan
    Liu, Shangdong
    Ji, Yimu
    INFORMATION SCIENCES, 2023, 648
  • [8] Evaluating Convolutional Neural Networks and Vision Transformers for Baby Cry Sound Analysis
    Younis, Samir A.
    Sobhy, Dalia
    Tawfik, Noha S.
    FUTURE INTERNET, 2024, 16 (07)
  • [9] Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review
    Mauricio, Jose
    Domingues, Ines
    Bernardino, Jorge
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [10] Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
    Lu, Zhiying
    Xie, Hongtao
    Liu, Chuanbin
    Zhang, Yongdong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,