An Adaptive Dual Selective Transformer for Temporal Action Localization

被引:1
|
作者
Li, Qiang [1 ]
Zu, Guang [1 ]
Xu, Hui [1 ]
Kong, Jun [1 ,2 ]
Zhang, Yanni [1 ]
Wang, Jianzhong [1 ]
机构
[1] Northeast Normal Univ, Sch Informat Sci & Technol, Changchun 130117, Peoples R China
[2] Northeast Normal Univ, Key Lab Appl Stat MOE, Changchun 130117, Peoples R China
关键词
Temporal action localization; action recognition; vision transformers; video understanding;
D O I
10.1109/TMM.2024.3367599
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Temporal action localization (TAL), which aims to identify and localize actions in long untrimmed videos, is a challenging task in video understanding. Recent studies have shown that the Transformer and its variants are effective at improving the performance of TAL. The success of the Transformer can be attributed to the use of multi-head self-attention (MHSA) as a token mixer to capture long-term temporal dependencies within the video sequence. However, in the existing Transformer architecture, the features obtained by multiple token mixing (i.e., self-attention) heads are treated equally, which neglects the distinct characteristics of different heads and hampers the exploitation of discriminative information. To this end, we present a new method called the adaptive dual selective Transformer (ADSFormer) for TAL in this paper. The key component in ADSFormer is the dual selective multi-head token mixer (DSMHTM), which integrates multiple feature representations from different token mixing heads by adaptively selecting important features across both the head and channel dimensions. Moreover, we also incorporate our ADSFormer into a pyramid structure so that the multi-scale features obtained can be effectively combined to improve TAL performance. Benefiting from the dual selective multi-head token mixer (DSMHTM) and pyramid feature combination, ADSFormer outperforms several state-of-the-art methods on four challenging benchmark datasets: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100 and ActivityNet-1.3.
引用
收藏
页码:7398 / 7412
页数:15
相关论文
共 50 条
  • [1] Temporal Deformable Transformer for Action Localization
    Wang, Haoying
    Wei, Ping
    Liu, Meiqin
    Zheng, Nanning
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI, 2023, 14259 : 563 - 575
  • [3] Dual relation network for temporal action localization
    Xia, Kun
    Wang, Le
    Zhou, Sanping
    Hua, Gang
    Tang, Wei
    [J]. PATTERN RECOGNITION, 2022, 129
  • [4] Cross Time-Frequency Transformer for Temporal Action Localization
    Yang, Jin
    Wei, Ping
    Zheng, Nanning
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4625 - 4638
  • [5] Multi-granularity transformer fusion for temporal action localization
    Zhang, Min
    Hu, Haiyang
    Li, Zhongjin
    [J]. Soft Computing, 2024, 28 (20) : 12377 - 12388
  • [6] TALLFormer: Temporal Action Localization with a Long-Memory Transformer
    Cheng, Feng
    Bertasius, Gedas
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 503 - 521
  • [7] A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization
    Gao, Zan
    Cui, Xinglei
    Zhuo, Tao
    Cheng, Zhiyong
    Liu, An-An
    Wang, Meng
    Chen, Shenyong
    [J]. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2023, 53 (03) : 569 - 580
  • [8] Gated Multi-Scale Transformer for Temporal Action Localization
    Yang, Jin
    Wei, Ping
    Ren, Ziyang
    Zheng, Nanning
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5705 - 5717
  • [9] W-ART: ACTION RELATION TRANSFORMER FOR WEAKLY-SUPERVISED TEMPORAL ACTION LOCALIZATION
    Li, Mengzhu
    Wu, Hongjun
    Liu, Yongcheng
    Liu, Hongzhe
    Xu, Cheng
    Li, Xuewei
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2195 - 2199
  • [10] Actionness-Guided Transformer for Anchor-Free Temporal Action Localization
    Zhao, Peisen
    Xie, Lingxi
    Zhang, Ya
    Tian, Qi
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 194 - 198