MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

被引:1
|
作者
Huo, Hua [1 ]
Li, Bingjie [1 ]
机构
[1] Henan Univ Sci & Technol, Informat Engn Coll, Luoyang 471000, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency;
D O I
10.3390/electronics13050948
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Feature fusion of multi-granularity and multi-scale for facial expression recognition
    Xia, Haiying
    Lu, Lidan
    Song, Shuxiang
    [J]. VISUAL COMPUTER, 2024, 40 (03): : 2035 - 2047
  • [2] Feature fusion of multi-granularity and multi-scale for facial expression recognition
    Haiying Xia
    Lidan Lu
    Shuxiang Song
    [J]. The Visual Computer, 2024, 40 : 2035 - 2047
  • [3] Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection
    Wang, Chongwen
    Wang, Zicheng
    [J]. FRONTIERS IN NEUROROBOTICS, 2022, 15
  • [4] Multi-granularity vision transformer via semantic token for hyperspectral image classification
    Li, Bin
    Ouyang, Er
    Hu, Wenjing
    Zhang, Guoyun
    Zhao, Lin
    Wu, Jianhui
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (17) : 6538 - 6560
  • [5] Information Extraction Network Based on Multi-Granularity Attention and Multi-Scale Self-Learning
    Sun, Weiwei
    Liu, Shengquan
    Liu, Yan
    Kong, Lingqi
    Jian, Zhaorui
    [J]. SENSORS, 2023, 23 (09)
  • [6] CLOCK: Online Temporal Hierarchical Framework for Multi-scale Multi-granularity Forecasting of User Impression
    Wang, XiaYou
    Guo, YongHui
    Ma, Xiaoyang
    Huang, Dongbo
    Xu, Lan
    Tan, Haisheng
    Zhou, Hao
    Li, Xiang-Yang
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 2544 - 2553
  • [7] Improving PTM Site Prediction by Coupling of Multi-Granularity Structure and Multi-Scale Sequence Representation
    Li, Zhengyi
    Li, Menglu
    Zhu, Lida
    Zhang, Wen
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 1, 2024, : 188 - 196
  • [8] Multi-granularity Transformer for Image Super-Resolution
    Zhuge, Yunzhi
    Jia, Xu
    [J]. COMPUTER VISION - ACCV 2022, PT III, 2023, 13843 : 138 - 154
  • [9] Multi-granularity Prediction for Scene Text Recognition
    Wang, Peng
    Da, Cheng
    Yao, Cong
    [J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 339 - 355
  • [10] Multi-granularity Generator for Temporal Action Proposal
    Liu, Yuan
    Ma, Lin
    Zhang, Yifeng
    Liu, Wei
    Chang, Shih-Fu
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3599 - 3608