Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

被引:14
|
作者
Wang, Mengmeng [1 ]
Xing, Jiazheng [1 ]
Su, Jing [2 ]
Chen, Jun [1 ]
Liu, Yong [1 ]
机构
[1] Zhejiang Univ, Coll Control Sci & Engn, Lab Adv Percept Robot & Intelligent Learning, Hangzhou 310027, Zhejiang, Peoples R China
[2] Fudan Univ, Dept Opt Sci & Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; frequency illustration; motion features; spatiotemporal features; twins training framework; REPRESENTATION;
D O I
10.1109/TPAMI.2022.3173658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Besides, we provide a distinctive illustration of the two modules from the frequency domain by interpreting them as advanced and learnable versions of frequency components. Second, we combine these two modules and an identity mapping path into one united block that can easily replace the original residual block in the ResNet architecture, forming a simple yet effective network dubbed STM network by introducing very limited extra computation cost and parameters. Third, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.
引用
收藏
页码:3347 / 3362
页数:16
相关论文
共 50 条
  • [1] Deep learning network model based on fusion of spatiotemporal features for action recognition
    Yang, Ge
    Zou, Wu-xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 9875 - 9896
  • [2] Deep learning network model based on fusion of spatiotemporal features for action recognition
    Ge Yang
    Wu-xing Zou
    Multimedia Tools and Applications, 2022, 81 : 9875 - 9896
  • [3] A spatiotemporal and motion information extraction network for action recognition
    Wang, Wei
    Wang, Xianmin
    Zhou, Mingliang
    Wei, Xuekai
    Li, Jing
    Ren, Xiaojun
    Zong, Xuemei
    WIRELESS NETWORKS, 2024, 30 (06) : 5389 - 5405
  • [4] Spatiotemporal attention enhanced features fusion network for action recognition
    Danfeng Zhuang
    Min Jiang
    Jun Kong
    Tianshan Liu
    International Journal of Machine Learning and Cybernetics, 2021, 12 : 823 - 841
  • [5] Spatiotemporal attention enhanced features fusion network for action recognition
    Zhuang, Danfeng
    Jiang, Min
    Kong, Jun
    Liu, Tianshan
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (03) : 823 - 841
  • [6] Learning Spatiotemporal Features for Infrared Action Recognition with 3D Convolutional Neural Networks
    Jiang, Zhuolin
    Rozgic, Viktor
    Adali, Sancar
    2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 309 - 317
  • [7] A 2D Convolutional Neural Network Approach for Human Action Recognition
    Toudjeu, Ignace Tchangou
    Tapamo, Jules-Raymond
    2019 IEEE AFRICON, 2019,
  • [8] Exploiting Spatiotemporal Features for Action Recognition
    Bin Muslim, Usairam
    Khan, Muhammad Hassan
    Farid, Muhammad Shahid
    PROCEEDINGS OF 2021 INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGIES (IBCAST), 2021, : 613 - 619
  • [9] Efficient 2D Temporal Modeling Network for Video Action Recognition
    Li, Zhilei
    Li, Jun
    Shi, Zhiping
    Jiang, Na
    Zhang, Yongkang
    Computer Engineering and Applications, 2024, 59 (03) : 127 - 134
  • [10] Fusing Spatiotemporal Features and Joints for 3D Action Recognition
    Zhu, Yu
    Chen, Wenbin
    Guo, Guodong
    2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2013, : 486 - 491