Convolutional transformer network for fine-grained action recognition

被引:2
|
作者
Ma, Yujun [1 ]
Wang, Ruili [1 ]
Zong, Ming [2 ]
Ji, Wanting [3 ]
Wang, Yi [4 ]
Ye, Baoliu [5 ]
机构
[1] Massey Univ, Sch Math & Computat Sci, Auckland, New Zealand
[2] Shanghai Inst Technol, Sch Comp Sci & Informat Engn, Shanghai, Peoples R China
[3] Liaoning Univ, Sch Informat, Shenyang, Peoples R China
[4] Dalian Univ Technol, DUT RU Int Sch Informat Sci Engn, Dalian, Peoples R China
[5] Nanjing Univ, Dept Comp Sci & Technol, Nanjing, Peoples R China
关键词
Fine-grained action recognition; Transformer; 3D convolutions; Spatial-temporal features; CNN;
D O I
10.1016/j.neucom.2023.127027
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing, Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN (e.g., sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer (e.g., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatialtemporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets (i.e., Epic-Kitchens and Diving 48) with a small computational increase.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Learning Convolutional Action Primitives for Fine-grained Action Recognition
    Lea, Colin
    Vidal, Rene
    Hager, Gregory D.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 1642 - 1649
  • [2] Fine-grained Vehicle Recognition by Deep Convolutional Neural Network
    Huang, Kun
    Zhang, Bailing
    [J]. 2016 9TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2016), 2016, : 465 - 470
  • [3] Periodic-Aware Network for Fine-Grained Action Recognition
    Luo, Senzi
    Xiao, Jiayin
    Li, Dong
    Jian, Muwei
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VIII, 2024, 14432 : 105 - 117
  • [4] FINE-GRAINED TOMATO DISEASE RECOGNITION BASED ON DEEP CONVOLUTIONAL NETWORK
    Liu, Yanhong
    Yang, Hua
    Guo, Xindong
    Li, Yanwen
    Hu, Zhiwei
    Hou, Yiming
    Song, Hongxia
    [J]. INMATEH-AGRICULTURAL ENGINEERING, 2022, 67 (02): : 182 - 190
  • [5] TransFG: A Transformer Architecture for Fine-Grained Recognition
    He, Ju
    Chen, Jie-Neng
    Liu, Shuai
    Kortylewski, Adam
    Yang, Cheng
    Bai, Yutong
    Wang, Changhu
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 852 - 860
  • [6] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
    Liu, Dichao
    Wang, Yu
    Kato, Jien
    [J]. VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
  • [7] Fine-Grained Action Recognition Based on Temporal Pyramid Excitation Network
    Zhou, Xuan
    Yi, Jianping
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 2103 - 2116
  • [8] Discriminative Segment Focus Network for Fine-grained Video Action Recognition
    Sun, Baoli
    Ye, Xinchen
    Yan, Tiantian
    Wang, Zhihui
    Li, Haojie
    Wang, Zhiyong
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [9] Multiple Recurrent Attention Convolutional Neural Network For fine-grained image recognition
    Zhu, Xiaotong
    Bian, Hengwei
    [J]. 2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 44 - 48
  • [10] DACBN: Dual attention convolutional broad network for fine-grained visual recognition
    Chen, Tao
    Wang, Lijie
    Liu, Yang
    Yu, Haisheng
    [J]. PATTERN RECOGNITION, 2024, 156