A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization

被引:3
|
作者
Gao, Zan [1 ,2 ]
Cui, Xinglei [1 ]
Zhuo, Tao [1 ]
Cheng, Zhiyong [1 ]
Liu, An-An [3 ]
Wang, Meng [4 ]
Chen, Shenyong [2 ]
机构
[1] Qilu Univ Technol, Shandong Artificial Intelligence Inst, Shandong Acad Sci, Jinan 250014, Peoples R China
[2] Tianjin Univ Technol, Key Lab Comp Vis & Syst, Minist Educ, Tianjin 300384, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[4] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Feature extraction; Proposals; Location awareness; Convolution; Task analysis; Frame-level self-attention (FSA); multiple temporal scales; refined feature pyramids (RFPs); spatial-temporal transformer (STT); temporal action localization (TAL); ACTION RECOGNITION; GRANULARITY;
D O I
10.1109/THMS.2023.3266037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. Previous methods often predict actions on a feature space of a single temporal scale. However, the temporal features of a low-level scale lack sufficient semantics for action classification, while a high-level scale cannot provide the rich details of the action boundaries. In addition, the long-range dependencies of video frames are often ignored. To address these issues, a novel multitemporal-scale spatial-temporal transformer (MSST) network is proposed for temporal action localization, which predicts actions on a feature space of multiple temporal scales. Specifically, we first use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Second, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then, the refined features with long-range dependencies are fed into a classifier for coarse action prediction. Finally, to further improve the prediction accuracy, we propose a frame-level self-attention module to refine the classification and boundaries of each action instance. Most importantly, these three modules are jointly explored in a unified framework, and MSST has an anchor-free and end-to-end architecture. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieve comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4%, and 2.2%, respectively.
引用
收藏
页码:569 / 580
页数:12
相关论文
共 50 条
  • [31] Multi-Branch Spatial-Temporal Network for Action Recognition
    Wang, Yingying
    Li, Wei
    Tao, Ran
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (10) : 1556 - 1560
  • [32] Action Recognition Using a Spatial-Temporal Network for Wild Felines
    Feng, Liqi
    Zhao, Yaqin
    Sun, Yichao
    Zhao, Wenxuan
    Tang, Jiaxi
    ANIMALS, 2021, 11 (02): : 1 - 18
  • [33] Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos
    Du, Wenbin
    Wang, Yali
    Qiao, Yu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (03) : 1347 - 1360
  • [34] TRANSTL: SPATIAL-TEMPORAL LOCALIZATION TRANSFORMER FOR MULTI-LABEL VIDEO CLASSIFICATION
    Wu, Hongjun
    Li, Mengzhu
    Liu, Yongcheng
    Liu, Hongzhe
    Xu, Cheng
    Li, Xuewei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1965 - 1969
  • [35] Concurrent Transformer for Spatial-Temporal Graph Modeling
    Xie, Yi
    Xiong, Yun
    Zhu, Yangyong
    Yu, Philip S.
    Jin, Cheng
    Wang, Qiang
    Li, Haihong
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT III, 2022, : 314 - 321
  • [36] Modelling a Framework to Obtain Violence Detection with Spatial-Temporal Action Localization
    Monteiro, Carlos
    Duraes, Dalila
    INFORMATION SYSTEMS AND TECHNOLOGIES, WORLDCIST 2022, VOL 1, 2022, 468 : 630 - 639
  • [37] ST-HViT: spatial-temporal hierarchical vision transformer for action recognition
    Limin Xia
    Weiye Fu
    Pattern Analysis and Applications, 2025, 28 (1)
  • [38] STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
    Zhu, Xiaoyu
    Huang, Po-Yao
    Liang, Junwei
    de Melo, Celso M.
    Hauptmann, Alexander
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 1526 - 1536
  • [39] TranSkeleton: Hierarchical Spatial-Temporal Transformer for Skeleton-Based Action Recognition
    Liu, Haowei
    Liu, Yongcheng
    Chen, Yuxin
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4137 - 4148
  • [40] Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition
    Gao, Zhimin
    Wang, Peitao
    Lv, Pei
    Jiang, Xiaoheng
    Liu, Qidong
    Wang, Pichao
    Xu, Mingliang
    Li, Wanqing
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 155 - 171