TubeR: Tubelet Transformer for Video Action Detection

被引:32
|
作者
Zhao, Jiaojiao [1 ]
Zhang, Yanyi [2 ]
Li, Xinyu [3 ]
Chen, Hao [3 ]
Shuai, Bing [3 ]
Xu, Mingze [3 ]
Liu, Chunhui [3 ]
Kundu, Kaustav [3 ]
Xiong, Yuanjun [3 ]
Modolo, Davide [3 ]
Marsic, Ivan [2 ]
Snoek, Cees G. M. [1 ]
Tighe, Joseph [3 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] AWS AI Labs, Palo Alto, CA USA
关键词
D O I
10.1109/CVPR52688.2022.01323
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/).
引用
收藏
页码:13588 / 13597
页数:10
相关论文
共 50 条
  • [1] TQRFormer: Tubelet query recollection transformer for action detection
    Wang, Xiangyang
    Yang, Kun
    Ding, Qiang
    Wang, Rui
    Sun, Jinhua
    IMAGE AND VISION COMPUTING, 2024, 147
  • [2] ENHANCED ACTION TUBELET DETECTOR FOR SPATIO-TEMPORAL VIDEO ACTION DETECTION
    Wu, Yutang
    Wang, Hanli
    Wang, Shuheng
    Li, Qinyu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2388 - 2392
  • [3] Detection and tracking based tubelet generation for video object detection
    Wang, Bin
    Tang, Sheng
    Xiao, Jun-Bin
    Yan, Quan-Feng
    Zhang, Yong-Dong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2019, 58 : 102 - 111
  • [4] Social Fabric: Tubelet Compositions for Video Relation Detection
    Chen, Shuo
    Shi, Zenglin
    Mettes, Pascal
    Snoek, Cees G. M.
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13465 - 13474
  • [5] Discriminative action tubelet detector for weakly-supervised action detection
    Lee, Jiyoung
    Kim, Seungryong
    Kim, Sunok
    Sohn, Kwanghoon
    PATTERN RECOGNITION, 2024, 155
  • [6] Recurrent Tubelet Proposal and Recognition Networks for Action Detection
    Li, Dong
    Qiu, Zhaofan
    Dai, Qi
    Yao, Ting
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 306 - 322
  • [7] Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos
    Wu, Yutang
    Wang, Hanli
    Li, Qinyu
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2018, PT II, 2018, 11165 : 296 - 306
  • [8] Video Action Transformer Network
    Girdhar, Rohit
    Carreira, Joao
    Doersch, Carl
    Zisserman, Andrew
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 244 - 253
  • [9] Generic Tubelet Proposals for Action Localization
    He, Jiawei
    Deng, Zhiwei
    Ibrahim, Mostafa S.
    Mori, Greg
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 343 - 351
  • [10] Recurring the Transformer for Video Action Recognition
    Yang, Jiewen
    Dong, Xingbo
    Liu, Liujun
    Zhang, Chao
    Shen, Jiajun
    Yu, Dahai
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14043 - 14053