TubeR: Tubelet Transformer for Video Action Detection

被引:32
|
作者
Zhao, Jiaojiao [1 ]
Zhang, Yanyi [2 ]
Li, Xinyu [3 ]
Chen, Hao [3 ]
Shuai, Bing [3 ]
Xu, Mingze [3 ]
Liu, Chunhui [3 ]
Kundu, Kaustav [3 ]
Xiong, Yuanjun [3 ]
Modolo, Davide [3 ]
Marsic, Ivan [2 ]
Snoek, Cees G. M. [1 ]
Tighe, Joseph [3 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] AWS AI Labs, Palo Alto, CA USA
关键词
D O I
10.1109/CVPR52688.2022.01323
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/).
引用
收藏
页码:13588 / 13597
页数:10
相关论文
共 50 条
  • [31] Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
    Thoker, Fida Mohammad
    Doughty, Hazel
    Snoek, Cees G. M.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13766 - 13777
  • [32] MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
    Chen, Jiawei
    Ho, Chiu Man
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 786 - 797
  • [33] Action-Centric Relation Transformer Network for Video Question Answering
    Zhang, Jipeng
    Shao, Jie
    Cao, Rui
    Gao, Lianli
    Xu, Xing
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 63 - 74
  • [34] Temporal Shift Vision Transformer Adapter for Efficient Video Action Recognition
    Shi, Yaning
    Sun, Pu
    Gu, Bing
    Li, Longfei
    PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND INTELLIGENT COMPUTING, BIC 2024, 2024, : 42 - 46
  • [35] A Multi-Modal Transformer network for action detection
    Korban, Matthew
    Youngs, Peter
    Acton, Scott T.
    PATTERN RECOGNITION, 2023, 142
  • [36] LGAFormer: transformer with local and global attention for action detection
    Zhang, Haiping
    Zhou, Fuxing
    Wang, Dongjing
    Zhang, Xinhao
    Yu, Dongjin
    Guan, Liming
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (12): : 17952 - 17979
  • [37] End-to-End Temporal Action Detection With Transformer
    Liu, Xiaolong
    Wang, Qimeng
    Hu, Yao
    Tang, Xu
    Zhang, Shiwei
    Bai, Song
    Bai, Xiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5427 - 5441
  • [38] Improved Deepfake Video Detection Using Convolutional Vision Transformer
    Deressa, Deressa Wodajo
    Lambert, Peter
    Van Wallendael, Glenn
    Atnafu, Solomon
    Mareen, Hannes
    2024 IEEE GAMING, ENTERTAINMENT, AND MEDIA CONFERENCE, GEM 2024, 2024, : 492 - 497
  • [39] Memory-Token Transformer for Unsupervised Video Anomaly Detection
    Li, Youyu
    Song, Xiaoning
    Xu, Tianyang
    Feng, Zhenhua
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 3325 - 3332
  • [40] Video Relation Detection via Tracklet based Visual Transformer
    Gao, Kaifeng
    Chen, Long
    Huang, Yifeng
    Xiao, Jun
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4833 - 4837