End-to-End Temporal Action Detection With Transformer

被引：70

作者：

Liu, Xiaolong ^{[1
]}

Wang, Qimeng ^{[1
]}

Hu, Yao ^{[2
]}

Tang, Xu ^{[2
]}

Zhang, Shiwei ^{[3
]}

Bai, Song ^{[4
]}

Bai, Xiang ^{[5
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Peoples R China

[2] Alibaba Grp, Beijing 100102, Peoples R China

[3] Alibaba Grp, Hangzhou 311121, Peoples R China

[4] ByteDance Inc, Singapore 048583, Singapore

[5] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Wuhan 430074, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

关键词：

Pipelines; Transformers; Proposals; Training; Feature extraction; Task analysis; Detectors; Transformer; temporal action detection; temporal action localization; action recognition;

D O I：

10.1109/TIP.2022.3195321

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

引用

页码：5427 / 5441

页数：15

共 50 条

[1] An Empirical Study of End-to-End Temporal Action Detection
Liu, Xiaolong
Bai, Song
Bai, Xiang
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19978 - 19987
[2] SFormer: An end-to-end spatio-temporal transformer architecture for deepfake detection
Kingra, Staffy
Aggarwal, Naveen
Kaur, Nirmal
[J]. FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 51
[3] An End-to-End Spatial-Temporal Transformer Model for Surgical Action Triplet Recognition
Zou, Xiaoyang
Yu, Derong
Tao, Rong
Zheng, Guoyan
[J]. 12TH ASIAN-PACIFIC CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING, VOL 2, APCMBE 2023, 2024, 104 : 114 - 120
[4] End-to-End Temporal Action Detection Using Bag of Discriminant Snippets
Murtaza, Fiza
Yousaf, Muhammad Haroon
Velastin, Sergio A.
Qian, Yu
[J]. IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (02) : 272 - 276
[5] DITA: DETR with improved queries for end-to-end temporal action detection
Lu, Chongkai
Mak, Man-Wai
[J]. NEUROCOMPUTING, 2024, 596
[6] End-to-end lane detection with convolution and transformer
Zekun Ge
Chao Ma
Zhumu Fu
Shuzhong Song
Pengju Si
[J]. Multimedia Tools and Applications, 2023, 82 : 29607 - 29627
[7] End-to-end lane detection with convolution and transformer
Ge, Zekun
Ma, Chao
Fu, Zhumu
Song, Shuzhong
Si, Pengju
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29607 - 29627
[8] Spatial–temporal transformer for end-to-end sign language recognition
Zhenchao Cui
Wenbo Zhang
Zhaoxin Li
Zhaoqi Wang
[J]. Complex & Intelligent Systems, 2023, 9 : 4645 - 4656
[9] SRDD: a lightweight end-to-end object detection with transformer
Zhu, Yuan
Xia, Qingyuan
Jin, Wen
[J]. CONNECTION SCIENCE, 2022, 34 (01) : 2448 - 2465
[10] Transformer Based End-to-End Mispronunciation Detection and Diagnosis
Wu, Minglin
Li, Kun
Leung, Wai-Kim
Meng, Helen
[J]. INTERSPEECH 2021, 2021, : 3954 - 3958

← 1 2 3 4 5 →