STEP: Spatio-Temporal Progressive Learning for Video Action Detection

被引:83
|
作者
Yang, Xitong [1 ,4 ]
Yang, Xiaodong [2 ]
Liu, Ming-Yu [2 ]
Xiao, Fanyi [3 ,4 ]
Davis, Larry [1 ]
Kautz, Jan [2 ]
机构
[1] Univ Maryland, College Pk, MD 20742 USA
[2] NVIDIA, Santa Clara, CA USA
[3] Univ Calif Davis, Davis, CA 95616 USA
[4] NVIDIA Res, Santa Clara, CA USA
关键词
D O I
10.1109/CVPR.2019.00035
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector-a progressive learning framework for spatio-temporal action detection in videos. Starting from a handful of coarse-scale proposal cuboids, our approach progressively refines the proposals towards actions over a few steps. In this way, high-quality proposals (i.e., adhere to action movements) can be gradually obtained at later steps by leveraging the regression outputs from previous steps. At each step, we adaptively extend the proposals in time to incorporate more related temporal context. Compared to the prior work that performs action detection in one run, our progressive learning framework is able to naturally handle the spatial displacement within action tubes and therefore provides a more effective way for spatio-temporal modeling. We extensively evaluate our approach on UCF101 and AVA, and demonstrate superior detection results. Remarkably, we achieve mAP of 75.0% and 18.6% on the two datasets with 3 progressive steps and using respectively only 11 and 34 initial proposals.
引用
收藏
页码:264 / 272
页数:9
相关论文
共 50 条
  • [1] Video action detection by learning graph-based spatio-temporal interactions
    Tomei, Matteo
    Baraldi, Lorenzo
    Calderara, Simone
    Bronzin, Simone
    Cucchiara, Rita
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206
  • [2] ENHANCED ACTION TUBELET DETECTOR FOR SPATIO-TEMPORAL VIDEO ACTION DETECTION
    Wu, Yutang
    Wang, Hanli
    Wang, Shuheng
    Li, Qinyu
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2388 - 2392
  • [3] SPATIO-TEMPORAL MOTION AGGREGATION NETWORK FOR VIDEO ACTION DETECTION
    Zhang, Hongcheng
    Zhao, Xu
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2180 - 2184
  • [4] Deep video action clustering via spatio-temporal feature learning
    Peng, Bo
    Lei, Jianjun
    Fu, Huazhu
    Jia, Yalong
    Zhang, Zongqian
    Li, Yi
    [J]. NEUROCOMPUTING, 2021, 456 : 519 - 527
  • [5] Learning spatio-temporal features for action recognition from the side of the video
    Pei, Lishen
    Ye, Mao
    Zhao, Xuezhuan
    Xiang, Tao
    Li, Tao
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (01) : 199 - 206
  • [6] Learning spatio-temporal features for action recognition from the side of the video
    Lishen Pei
    Mao Ye
    Xuezhuan Zhao
    Tao Xiang
    Tao Li
    [J]. Signal, Image and Video Processing, 2016, 10 : 199 - 206
  • [7] Interactive spatio-temporal feature learning network for video foreground detection
    Hongrui Zhang
    Huan Li
    [J]. Complex & Intelligent Systems, 2022, 8 : 4251 - 4263
  • [8] Dynamic Difference Learning With Spatio-Temporal Correlation for Deepfake Video Detection
    Yin, Qilin
    Lu, Wei
    Li, Bin
    Huang, Jiwu
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 4046 - 4058
  • [9] Interactive spatio-temporal feature learning network for video foreground detection
    Zhang, Hongrui
    Li, Huan
    [J]. COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (05) : 4251 - 4263
  • [10] Adversarial Spatio-Temporal Learning for Video Deblurring
    Zhang, Kaihao
    Luo, Wenhan
    Zhong, Yiran
    Ma, Lin
    Liu, Wei
    Li, Hongdong
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (01) : 291 - 301