TQRFormer: Tubelet query recollection transformer for action detection

被引：1

作者：

Wang, Xiangyang ^{[1
]}

Yang, Kun ^{[1
]}

Ding, Qiang ^{[2
]}

Wang, Rui ^{[1
]}

Sun, Jinhua ^{[2
]}

机构：

[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai, Peoples R China

[2] Fudan Univ, Natl Childrens Med Ctr, Dept Psychol Med, Childrens Hosp, Shanghai 201102, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2024年 / 147卷

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Spatio-temporal action detection; Transformer; Query recollection; Matching strategy; Long-term context;

D O I：

10.1016/j.imavis.2024.105059

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Spatial and temporal action detection aims to precisely locate actions while predicting their respective categories. The existing solution, TubeR (Zhao et al., 2022), is designed to directly detect action tubes in videos by recognizing and localizing actions using a unified representation. However, a potential challenge arises during the decoding stage, leading to a gradual decrease in the model's performance in action detection, specifically in terms of the confidence associated with detected actions. In this paper, we propose TQRFormer: Tubelet Query Recollection Transformer, enabling the subsequent decoder to obtain information from the previous stage. Specifically, we designed Query Recollection Attention to correct errors and output the synthesized results, effectively breaking the limitations of sequential decoding. During the training stage, TubeR (Zhao et al., 2022) generates a limited number of positive sample queries through a one-to-one matching strategy, potentially impacting the effectiveness of training with positive samples. To enhance the quantity of positive samples, we propose a stage matching approach that combines both one -to -many matching and one-to-one matching without additional queries. This approach serves to boost the overall number of positive samples for improved training outcomes. We also propose a more elegant classification head that contains the start and end frames of the small tubes information, eliminating the necessity for a separate action switch. The performance of TQRFormer is superior to previous state-of-the-art technologies on public action detection datasets, including AVA, UCF101 -24, JHMDB-21 and MultiSports. The code will available at https://github.com/ykyk000/TQRFormer.

引用

页数：11

共 50 条

[1] TubeR: Tubelet Transformer for Video Action Detection
Zhao, Jiaojiao
Zhang, Yanyi
Li, Xinyu
Chen, Hao
Shuai, Bing
Xu, Mingze
Liu, Chunhui
Kundu, Kaustav
Xiong, Yuanjun
Modolo, Davide
Marsic, Ivan
Snoek, Cees G. M.
Tighe, Joseph
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13588 - 13597
[2] Discriminative action tubelet detector for weakly-supervised action detection
Lee, Jiyoung
Kim, Seungryong
Kim, Sunok
Sohn, Kwanghoon
PATTERN RECOGNITION, 2024, 155
[3] Recurrent Tubelet Proposal and Recognition Networks for Action Detection
Li, Dong
Qiu, Zhaofan
Dai, Qi
Yao, Ting
Mei, Tao
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 306 - 322
[4] Online Action Detection by Long Short-term Transformer with Query Exemplars-transformer
Zhang, Honglei
Guo, Yijing
Dui, Xiaofu
2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,
[5] Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos
Wu, Yutang
Wang, Hanli
Li, Qinyu
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2018, PT II, 2018, 11165 : 296 - 306
[6] ENHANCED ACTION TUBELET DETECTOR FOR SPATIO-TEMPORAL VIDEO ACTION DETECTION
Wu, Yutang
Wang, Hanli
Wang, Shuheng
Li, Qinyu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2388 - 2392
[7] Enhanced Training of Query-Based Object Detection via Selective Query Recollection
Chen, Fangyi
Zhang, Han
Hu, Kai
Huang, Yu-Kai
Zhu, Chenchen
Savvides, Marios
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23756 - 23765
[8] Generic Tubelet Proposals for Action Localization
He, Jiawei
Deng, Zhiwei
Ibrahim, Mostafa S.
Mori, Greg
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 343 - 351
[9] Action Tubelet Detector for Spatio-Temporal Action Localization
Kalogeiton, Vicky
Weinzaepfel, Philippe
Ferrari, Vittorio
Schmid, Cordelia
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4415 - 4423
[10] Object Detection in Videos with Tubelet Proposal Networks
Kang, Kai
Li, Hongsheng
Xiao, Tong
Ouyang, Wanli
Yan, Junjie
Liu, Xihui
Wang, Xiaogang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 889 - 897

← 1 2 3 4 5 →