Collaborative Spatiotemporal Feature Learning for Video Action Recognition

被引:121
|
作者
Li, Chao [1 ]
Zhong, Qiaoyong [1 ]
Xie, Di [1 ]
Pu, Shiliang [1 ]
机构
[1] Hikvis Res Inst, Hangzhou, Peoples R China
关键词
D O I
10.1109/CVPR.2019.00806
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatiotemporal feature learningis of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D). In this paper, we propose a novel neural operation which encodes spatiotemporal features collaboratively by imposing a weight-sharing constraint on the learnable parameters. In particular we perform 2D convolution along three orthogonal views of volumetric video data, which learns spatial appearance and temporal motion cues respectively. By sharing the convolution kernels of different views, spatial and temporal features are collaboratively learned and thus benefit from each other The complementary features are subsequently fused by a weighted summation whose coefficients are learned end-to-end. Our approach achieves state-of-the-art performance on large-scale benchmarks and won the 1st place in the Moments in Time Challenge 2018. Moreover based on the learned coefficients of different views, we are able to quantify the contributions of spatial and temporal features. This analysis sheds light on interpretability of the model and may also guide the future design of algorithm for video recognition.
引用
收藏
页码:7864 / 7873
页数:10
相关论文
共 50 条
  • [1] Recurrent Spatiotemporal Feature Learning for Action Recognition
    Chen, Ze
    Lu, Hongtao
    [J]. ICRAI 2018: PROCEEDINGS OF 2018 4TH INTERNATIONAL CONFERENCE ON ROBOTICS AND ARTIFICIAL INTELLIGENCE -, 2018, : 12 - 17
  • [2] Spatiotemporal Saliency Representation Learning for Video Action Recognition
    Kong, Yongqiang
    Wang, Yunhong
    Li, Annan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
  • [3] Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition
    Ahsan, Unaiza
    Madhok, Rishi
    Essa, Irfan
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 179 - 189
  • [4] Collaborative multimodal feature learning for RGB-D action recognition
    Kong, Jun
    Liu, Tianshan
    Jiang, Min
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2019, 59 : 537 - 549
  • [5] Metric-Based Attention Feature Learning for Video Action Recognition
    Kim, Dae Ha
    Anvarov, Fazliddin
    Lee, Jun Min
    Song, Byung Cheol
    [J]. IEEE ACCESS, 2021, 9 : 39218 - 39228
  • [6] Spatiotemporal feature enhancement network for action recognition
    Huang, Guancheng
    Wang, Xiuhui
    Li, Xuesheng
    Wang, Yaru
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (19) : 57187 - 57197
  • [7] Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition
    Wu, Hanbo
    Ma, Xin
    Li, Yibin
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1250 - 1261
  • [8] Spatiotemporal Residual Networks for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Wildes, Richard P.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [9] Spatiotemporal Pyramid Network for Video Action Recognition
    Wang, Yunbo
    Long, Mingsheng
    Wang, Jianmin
    Yu, Philip S.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106
  • [10] Spatiotemporal Fusion Networks for Video Action Recognition
    Zheng Liu
    Haifeng Hu
    Junxuan Zhang
    [J]. Neural Processing Letters, 2019, 50 : 1877 - 1890