TQRFormer: Tubelet query recollection transformer for action detection

被引:1
|
作者
Wang, Xiangyang [1 ]
Yang, Kun [1 ]
Ding, Qiang [2 ]
Wang, Rui [1 ]
Sun, Jinhua [2 ]
机构
[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai, Peoples R China
[2] Fudan Univ, Natl Childrens Med Ctr, Dept Psychol Med, Childrens Hosp, Shanghai 201102, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Spatio-temporal action detection; Transformer; Query recollection; Matching strategy; Long-term context;
D O I
10.1016/j.imavis.2024.105059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatial and temporal action detection aims to precisely locate actions while predicting their respective categories. The existing solution, TubeR (Zhao et al., 2022), is designed to directly detect action tubes in videos by recognizing and localizing actions using a unified representation. However, a potential challenge arises during the decoding stage, leading to a gradual decrease in the model's performance in action detection, specifically in terms of the confidence associated with detected actions. In this paper, we propose TQRFormer: Tubelet Query Recollection Transformer, enabling the subsequent decoder to obtain information from the previous stage. Specifically, we designed Query Recollection Attention to correct errors and output the synthesized results, effectively breaking the limitations of sequential decoding. During the training stage, TubeR (Zhao et al., 2022) generates a limited number of positive sample queries through a one-to-one matching strategy, potentially impacting the effectiveness of training with positive samples. To enhance the quantity of positive samples, we propose a stage matching approach that combines both one -to -many matching and one-to-one matching without additional queries. This approach serves to boost the overall number of positive samples for improved training outcomes. We also propose a more elegant classification head that contains the start and end frames of the small tubes information, eliminating the necessity for a separate action switch. The performance of TQRFormer is superior to previous state-of-the-art technologies on public action detection datasets, including AVA, UCF101 -24, JHMDB-21 and MultiSports. The code will available at https://github.com/ykyk000/TQRFormer.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Fast Action Detection with One Query Example Based on Hough Voting
    Pei, Lishen
    Ye, Mao
    PATTERN RECOGNITION, 2012, 321 : 129 - 136
  • [32] Event Tubelet Compressor: Generating Compact Representations for Event-Based Action Recognition
    Xie, Bochen
    Deng, Yongjian
    Shao, Zhanpeng
    Liu, Hai
    Xu, Qingsong
    Li, Youfu
    2022 7TH INTERNATIONAL CONFERENCE ON CONTROL, ROBOTICS AND CYBERNETICS, CRC, 2022, : 12 - 16
  • [33] An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
    Weng, Yuetian
    Pan, Zizheng
    Han, Mingfei
    Chang, Xiaojun
    Zhuang, Bohan
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 358 - 375
  • [34] Long Short-Term Transformer for Online Action Detection
    Xu, Mingze
    Xiong, Yuanjun
    Chen, Hao
    Li, Xinyu
    Xia, Wei
    Tu, Zhuowen
    Soatto, Stefano
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [35] Visual Explanation With Action Query Transformer in Deep Reinforcement Learning and Visual Feedback via Augmented Reality
    Itaya, Hidenori
    Yin, Wantao
    Hirakawa, Tsubasa
    Yamashita, Takayoshi
    Fujiyoshi, Hironobu
    Sugiura, Komei
    IEEE ACCESS, 2025, 13 : 56338 - 56354
  • [36] DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion
    Guo, Junjie
    Gao, Chenqiang
    Liu, Fangcen
    Meng, Deyu
    Gao, Xinbo
    COMPUTER VISION - ECCV 2024, PT XXVII, 2025, 15085 : 464 - 481
  • [37] AnchorPoint: Query Design for Transformer-Based 3D Object Detection and Tracking
    Liu, Hao
    Ma, Yanni
    Wang, Hanyun
    Zhang, Chaobo
    Guo, Yulan
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (10) : 10988 - 11000
  • [38] Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos
    Cores, Daniel
    Brea, Victor M.
    Mucientes, Manuel
    APPLIED INTELLIGENCE, 2023, 53 (01) : 1205 - 1217
  • [39] PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points
    Tan, Jing
    Zhao, Xiaotong
    Shi, Xintian
    Kang, Bin
    Wang, Limin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [40] A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection
    Korban, Matthew
    Youngs, Peter
    Acton, Scott T.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (09) : 6055 - 6069