End-to-End Temporal Action Detection With Transformer

被引:70
|
作者
Liu, Xiaolong [1 ]
Wang, Qimeng [1 ]
Hu, Yao [2 ]
Tang, Xu [2 ]
Zhang, Shiwei [3 ]
Bai, Song [4 ]
Bai, Xiang [5 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China
[2] Alibaba Grp, Beijing 100102, Peoples R China
[3] Alibaba Grp, Hangzhou 311121, Peoples R China
[4] ByteDance Inc, Singapore 048583, Singapore
[5] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China
关键词
Pipelines; Transformers; Proposals; Training; Feature extraction; Task analysis; Detectors; Transformer; temporal action detection; temporal action localization; action recognition;
D O I
10.1109/TIP.2022.3195321
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.
引用
收藏
页码:5427 / 5441
页数:15
相关论文
共 50 条
  • [1] An Empirical Study of End-to-End Temporal Action Detection
    Liu, Xiaolong
    Bai, Song
    Bai, Xiang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19978 - 19987
  • [2] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
    Kingra, Staffy
    Aggarwal, Naveen
    Kaur, Nirmal
    [J]. FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
  • [3] An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition
    Zou, Xiaoyang
    Yu, Derong
    Tao, Rong
    Zheng, Guoyan
    [J]. 12TH ASIAN-PACIFIC CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, VOL 2, APCMBE 2023, 2024, 104 : 114 - 120
  • [4] End-to-End Temporal Action Detection Using Bag of Discriminant Snippets
    Murtaza, Fiza
    Yousaf, Muhammad Haroon
    Velastin, Sergio A.
    Qian, Yu
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (02) : 272 - 276
  • [5] DITA: DETR with improved queries for end-to-end temporal action detection
    Lu, Chongkai
    Mak, Man-Wai
    [J]. NEUROCOMPUTING, 2024, 596
  • [6] End-to-end lane detection with convolution and transformer
    Zekun Ge
    Chao Ma
    Zhumu Fu
    Shuzhong Song
    Pengju Si
    [J]. Multimedia Tools and Applications, 2023, 82 : 29607 - 29627
  • [7] End-to-end lane detection with convolution and transformer
    Ge, Zekun
    Ma, Chao
    Fu, Zhumu
    Song, Shuzhong
    Si, Pengju
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29607 - 29627
  • [8] Spatial–temporal transformer for end-to-end sign language recognition
    Zhenchao Cui
    Wenbo Zhang
    Zhaoxin Li
    Zhaoqi Wang
    [J]. Complex & Intelligent Systems, 2023, 9 : 4645 - 4656
  • [9] SRDD: a lightweight end-to-end object detection with transformer
    Zhu, Yuan
    Xia, Qingyuan
    Jin, Wen
    [J]. CONNECTION SCIENCE, 2022, 34 (01) : 2448 - 2465
  • [10] Transformer Based End-to-End Mispronunciation Detection and Diagnosis
    Wu, Minglin
    Li, Kun
    Leung, Wai-Kim
    Meng, Helen
    [J]. INTERSPEECH 2021, 2021, : 3954 - 3958