TubeR: Tubelet Transformer for Video Action Detection

被引:32
|
作者
Zhao, Jiaojiao [1 ]
Zhang, Yanyi [2 ]
Li, Xinyu [3 ]
Chen, Hao [3 ]
Shuai, Bing [3 ]
Xu, Mingze [3 ]
Liu, Chunhui [3 ]
Kundu, Kaustav [3 ]
Xiong, Yuanjun [3 ]
Modolo, Davide [3 ]
Marsic, Ivan [2 ]
Snoek, Cees G. M. [1 ]
Tighe, Joseph [3 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] AWS AI Labs, Palo Alto, CA USA
关键词
D O I
10.1109/CVPR52688.2022.01323
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/).
引用
收藏
页码:13588 / 13597
页数:10
相关论文
共 50 条
  • [21] TRCDNet: A Transformer Network for Video Cloud Detection
    Luo, Chen
    Feng, Shanshan
    Quan, Yingling
    Ye, Yunming
    Li, Xutao
    Xu, Yong
    Zhang, Baoquan
    Chen, Zhihao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [22] Deepfake Video Detection with Spatiotemporal Dropout Transformer
    Zhang, Daichi
    Lin, Fanzhao
    Hua, Yingying
    Wang, Pengju
    Zeng, Dan
    Ge, Shiming
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5833 - 5841
  • [23] Body-part Tubelet Transformer for Human-Related Crime Classification
    Joseph, Ajay Mathew
    Ullah, Fath U. Min
    Talavera, Estefania
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
  • [24] TWO-PATHWAY TRANSFORMER NETWORK FOR VIDEO ACTION RECOGNITION
    Jiang, Bo
    Yu, Jiahong
    Zhou, Lei
    Wu, Kailin
    Yang, Yang
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1089 - 1093
  • [25] SVFormer: Semi-supervised Video Transformer for Action Recognition
    Xing, Zhen
    Dai, Qi
    Hu, Han
    Chen, Jingjing
    Wu, Zuxuan
    Jiang, Yu-Gang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18816 - 18826
  • [26] FSformer: Fast-Slow Transformer for video action recognition
    Li, Shibao
    Wang, Zhaoyu
    Liu, Yixuan
    Zhang, Yunwu
    Zhu, Jinze
    Cui, Xuerong
    Liu, Jianhang
    IMAGE AND VISION COMPUTING, 2023, 137
  • [27] Holistic Interaction Transformer Network for Action Detection
    Faure, Gueter Josmy
    Chen, Min-Hung
    Lai, Shang-Hong
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3329 - 3339
  • [28] SCOTCH and SODA: A Transformer Video Shadow Detection Framework
    Liu, Lihao
    Prost, Jean
    Zhu, Lei
    Papadakis, Nicolas
    Lio, Pietro
    Schonlieb, Carola-Bibiane
    Aviles-Rivero, Angelica I.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10449 - 10458
  • [29] Learning Video Localization on Segment-Level Video Copy Detection with Transformer
    Zhang, Chi
    Liu, Jie
    Zhang, Shuwu
    Zeng, Zhi
    Huang, Ying
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 439 - 450
  • [30] Video Sparse Transformer With Attention-Guided Memory for Video Object Detection
    Fujitake, Masato
    Sugimoto, Akihiro
    IEEE ACCESS, 2022, 10 : 65886 - 65900