A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization

被引:3
|
作者
Gao, Zan [1 ,2 ]
Cui, Xinglei [1 ]
Zhuo, Tao [1 ]
Cheng, Zhiyong [1 ]
Liu, An-An [3 ]
Wang, Meng [4 ]
Chen, Shenyong [2 ]
机构
[1] Qilu Univ Technol, Shandong Artificial Intelligence Inst, Shandong Acad Sci, Jinan 250014, Peoples R China
[2] Tianjin Univ Technol, Key Lab Comp Vis & Syst, Minist Educ, Tianjin 300384, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[4] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Feature extraction; Proposals; Location awareness; Convolution; Task analysis; Frame-level self-attention (FSA); multiple temporal scales; refined feature pyramids (RFPs); spatial-temporal transformer (STT); temporal action localization (TAL); ACTION RECOGNITION; GRANULARITY;
D O I
10.1109/THMS.2023.3266037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. Previous methods often predict actions on a feature space of a single temporal scale. However, the temporal features of a low-level scale lack sufficient semantics for action classification, while a high-level scale cannot provide the rich details of the action boundaries. In addition, the long-range dependencies of video frames are often ignored. To address these issues, a novel multitemporal-scale spatial-temporal transformer (MSST) network is proposed for temporal action localization, which predicts actions on a feature space of multiple temporal scales. Specifically, we first use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Second, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then, the refined features with long-range dependencies are fed into a classifier for coarse action prediction. Finally, to further improve the prediction accuracy, we propose a frame-level self-attention module to refine the classification and boundaries of each action instance. Most importantly, these three modules are jointly explored in a unified framework, and MSST has an anchor-free and end-to-end architecture. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieve comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4%, and 2.2%, respectively.
引用
收藏
页码:569 / 580
页数:12
相关论文
共 50 条
  • [1] Spatial-temporal Graph Transformer Network for Spatial-temporal Forecasting
    Dao, Minh-Son
    Zetsu, Koji
    Hoang, Duy-Tang
    [J]. Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024, 2024, : 1276 - 1281
  • [2] Fast Spatial-Temporal Transformer Network
    Escher, Rafael Molossi
    de Bem, Rodrigo Andrade
    Jorge Drews Jr, Paulo Lilles
    [J]. 2021 34TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI 2021), 2021, : 65 - 72
  • [3] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Zhang, Zhao
    Liu, Peng
    Tang, Xianglong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 44273 - 44297
  • [4] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Xiaoyan Tian
    Ye Jin
    Zhao Zhang
    Peng Liu
    Xianglong Tang
    [J]. Multimedia Tools and Applications, 2024, 83 : 44273 - 44297
  • [5] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
    Huang, Jianfeng
    Liu, Xiang
    Hu, Huan
    Tang, Shanghua
    Li, Chenyang
    Zhao, Shaoan
    Lin, Yimin
    Wang, Kai
    Liu, Zhaoxiang
    Lian, Shiguo
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
  • [7] STAN: Spatial-Temporal Awareness Network for Temporal Action Detection
    Liu, Minghao
    Liu, Haiyi
    Zhao, Sirui
    Ma, Fei
    Li, Minglei
    Dai, Zonghong
    Wang, Hao
    Xu, Tong
    Chen, Enhong
    [J]. PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON MULTIMEDIA CONTENT ANALYSIS IN SPORTS, MMSPORTS 2023, 2023, : 161 - 165
  • [8] Multi-Scale Spatial-Temporal Transformer: A Novel Framework for Spatial-Temporal Edge Data Prediction
    Ming, Junhao
    Zhang, Dongmei
    Han, Wei
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (17):
  • [9] Graph Spatial-Temporal Transformer Network for Traffic Prediction
    Zhao, Zhenzhen
    Shen, Guojiang
    Wang, Lei
    Kong, Xiangjie
    [J]. BIG DATA RESEARCH, 2024, 36
  • [10] Hierarchy Spatial-Temporal Transformer for Action Recognition in Short Videos
    Cai, Guoyong
    Cai, Yumeng
    [J]. FUZZY SYSTEMS AND DATA MINING VI, 2020, 331 : 760 - 774