STM: SpatioTemporal and Motion Encoding for Action Recognition

被引:312
|
作者
Jiang, Boyuan [1 ,3 ]
Wang, MengMeng [2 ]
Gan, Weihao [2 ]
Wu, Wei [2 ]
Yan, Junjie [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] SenseTime Grp Ltd, Hong Kong, Peoples R China
[3] SenseTime, Hong Kong, Peoples R China
关键词
D O I
10.1109/ICCV.2019.00209
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.
引用
收藏
页码:2000 / 2009
页数:10
相关论文
共 50 条
  • [1] A spatiotemporal and motion information extraction network for action recognition
    Wang, Wei
    Wang, Xianmin
    Zhou, Mingliang
    Wei, Xuekai
    Li, Jing
    Ren, Xiaojun
    Zong, Xuemei
    WIRELESS NETWORKS, 2024, 30 (06) : 5389 - 5405
  • [2] Local motion feature extraction and spatiotemporal attention mechanism for action recognition
    Song, Xiaogang
    Zhang, Dongdong
    Liang, Li
    He, Min
    Hei, Xinhong
    VISUAL COMPUTER, 2023, 40 (11): : 7747 - 7759
  • [3] Criminal action recognition using spatiotemporal human motion acceleration descriptor
    Mir, Abinta Mehmood
    Yousaf, Muhammad Haroon
    Dawood, Hassan
    JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (06)
  • [4] Auxiliary criterion conversion via spatiotemporal semantic encoding and feature entropy for action recognition
    Xiaoyan Meng
    Guoliang Zhang
    Songmin Jia
    Xiuzhi Li
    Xiangyin Zhang
    The Visual Computer, 2021, 37 : 1673 - 1690
  • [5] Auxiliary criterion conversion via spatiotemporal semantic encoding and feature entropy for action recognition
    Meng, Xiaoyan
    Zhang, Guoliang
    Jia, Songmin
    Li, Xiuzhi
    Zhang, Xiangyin
    VISUAL COMPUTER, 2021, 37 (07): : 1673 - 1690
  • [6] Natural Action Recognition Using Invariant 3D Motion Encoding
    Hadfield, Simon
    Lebeda, Karel
    Bowden, Richard
    COMPUTER VISION - ECCV 2014, PT II, 2014, 8690 : 758 - 771
  • [7] Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition
    Wang, Mengmeng
    Xing, Jiazheng
    Su, Jing
    Chen, Jun
    Liu, Yong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3347 - 3362
  • [8] Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition
    Planamente, Mirco
    Bottino, Andrea
    Caputo, Barbara
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8751 - 8758
  • [9] Exploiting Spatiotemporal Features for Action Recognition
    Bin Muslim, Usairam
    Khan, Muhammad Hassan
    Farid, Muhammad Shahid
    PROCEEDINGS OF 2021 INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGIES (IBCAST), 2021, : 613 - 619
  • [10] Spatiotemporal saliency for human action recognition
    Oikonomopoulos, A
    Patras, I
    Pantic, M
    2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 430 - 433