TQRFormer: Tubelet query recollection transformer for action detection

被引:1
|
作者
Wang, Xiangyang [1 ]
Yang, Kun [1 ]
Ding, Qiang [2 ]
Wang, Rui [1 ]
Sun, Jinhua [2 ]
机构
[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai, Peoples R China
[2] Fudan Univ, Natl Childrens Med Ctr, Dept Psychol Med, Childrens Hosp, Shanghai 201102, Peoples R China
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Spatio-temporal action detection; Transformer; Query recollection; Matching strategy; Long-term context;
D O I
10.1016/j.imavis.2024.105059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatial and temporal action detection aims to precisely locate actions while predicting their respective categories. The existing solution, TubeR (Zhao et al., 2022), is designed to directly detect action tubes in videos by recognizing and localizing actions using a unified representation. However, a potential challenge arises during the decoding stage, leading to a gradual decrease in the model's performance in action detection, specifically in terms of the confidence associated with detected actions. In this paper, we propose TQRFormer: Tubelet Query Recollection Transformer, enabling the subsequent decoder to obtain information from the previous stage. Specifically, we designed Query Recollection Attention to correct errors and output the synthesized results, effectively breaking the limitations of sequential decoding. During the training stage, TubeR (Zhao et al., 2022) generates a limited number of positive sample queries through a one-to-one matching strategy, potentially impacting the effectiveness of training with positive samples. To enhance the quantity of positive samples, we propose a stage matching approach that combines both one -to -many matching and one-to-one matching without additional queries. This approach serves to boost the overall number of positive samples for improved training outcomes. We also propose a more elegant classification head that contains the start and end frames of the small tubes information, eliminating the necessity for a separate action switch. The performance of TQRFormer is superior to previous state-of-the-art technologies on public action detection datasets, including AVA, UCF101 -24, JHMDB-21 and MultiSports. The code will available at https://github.com/ykyk000/TQRFormer.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] CFI-Former: Efficient lane detection by multi-granularity perceptual query attention transformer
    Gao, Rong
    Hu, Siqi
    Yan, Lingyu
    Zhang, Lefei
    Wu, Jia
    NEURAL NETWORKS, 2025, 187
  • [42] AQSFormer: Adaptive Query Selection Transformer for Real-Time Ship Detection from Visual Images
    Yang, Wei
    Jiang, Yueqiu
    Gao, Hongwei
    Bai, Xue
    Liu, Bo
    Xia, Caifeng
    ELECTRONICS, 2024, 13 (23):
  • [43] Simple Conditional Spatial Query Mask Deformable Detection Transformer: A Detection Approach for Multi-Style Strokes of Chinese Characters
    Zhou, Tian
    Xie, Wu
    Zhang, Huimin
    Fan, Yong
    SENSORS, 2024, 24 (03)
  • [44] Video-based Human-Object Interaction Detection from Tubelet Tokens
    Tu, Danyang
    Sun, Wei
    Min, Xiongkuo
    Zhai, Guangtao
    Shen, Wei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [45] Omni-TransPose: Fusion of OmniPose and Transformer Architecture for Improving Action Detection
    Phu, Khac-Anh
    Hoang, Van-Dung
    Le, Van-Tuong-Lan
    Tran, Quang-Khai
    RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 2145 : 59 - 70
  • [46] Sparse landmarks for facial action unit detection using vision transformer and perceiver
    Cakir, Duygu
    Yilmaz, Gorkem
    Arica, Nafiz
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2024, 27 (05) : 607 - 620
  • [47] Stargazer: A Transformer-based Driver Action Detection System for Intelligent Transportation
    Liang, Junwei
    Zhu, He
    Zhang, Enwei
    Zhang, Jun
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 3159 - 3166
  • [48] Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection
    Wang, Chongwen
    Wang, Zicheng
    FRONTIERS IN NEUROROBOTICS, 2022, 15 (15):
  • [49] QueryFormer: A Tree Transformer Model for Query Plan Representation
    Zhao, Yue
    Cong, Gao
    Shi, Jiachen
    Miao, Chunyan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (08): : 1658 - 1670
  • [50] Hierarchical Transformer-based Query by Multiple Documents
    Huang, Zhiqi
    Naseri, Shahrzad
    Bonab, Hamed
    Sarwar, Sheikh Muhammad
    Allan, James
    PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 105 - 115