Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

被引:2
|
作者
Sun, Baoli [1 ]
Ye, Xinchen [1 ]
Wang, Zhihui [1 ]
Li, Haojie [2 ]
Wang, Zhiyong [3 ]
机构
[1] Dalian Univ Technol, Int Sch Informat Sci & Engn, Dalian, Peoples R China
[2] Shandong Univ Sci & Technol, Coll Comp Sci & Engn, Qingdao, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Fine-grained; action recognition; token localization and interaction; vision transformer;
D O I
10.1145/3581783.3612206
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers have achieved impressive performance for video action recognition due to their strong capability of modeling long-range dependencies among spatio-temporal tokens. However, as for fine-grained actions, subtle and discriminative differences mainly exist in the regions of actors, directly utilizing vision transformers without removing irrelevant tokens will compromise recognition performance and lead to high computational costs. In this paper, we propose a coarse-to-fine action token localization and interaction network, namely C2F-ALIN, that dynamically localizes the most informative tokens at a coarse granularity and then partitions these located tokens to a fine granularity for sufficient fine-grained spatio-temporal interaction. Specifically, in the coarse stage, we devise a discriminative token localization module to accurately identify informative tokens and to discard irrelevant tokens, where each localized token corresponds to a large spatial region, thus effectively preserving the continuity of action regions. In the fine stage, we only further partition the localized tokens obtained in the coarse stage into a finer granularity and then characterize fine-grained token interactions in two aspects: (1) first using vanilla transformers to learn compact dependencies among all discriminative tokens; and (2) proposing a global contextual interaction module which enables each fine-grained tokens to communicate with all the spatio-temporal tokens and to embed the global context. As a result, our coarse-to-fine strategy is able to identify more relevant tokens and integrate global context for high recognition accuracy while maintaining high efficiency. Comprehensive experimental results on four widely used action recognition benchmarks, including FineGym, Diving48, Kinetics and Something-Something, clearly demonstrate the advantages of our proposed method in comparison with other state-of-the-art ones.
引用
收藏
页码:5070 / 5078
页数:9
相关论文
共 50 条
  • [1] EXPLOITING COARSE-TO-FINE MECHANISM FOR FINE-GRAINED RECOGNITION
    Wang, Yongzhong
    Zhang, Xu-Yao
    Zhang, Yanming
    Hou, Xinwen
    Liu, Cheng-Lin
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 649 - 653
  • [2] Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis
    Ni, Bingbing
    Paramathayalan, Vignesh R.
    Li, Teng
    Moulin, Pierre
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 120 (01) : 28 - 43
  • [3] Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis
    Bingbing Ni
    Vignesh R. Paramathayalan
    Teng Li
    Pierre Moulin
    [J]. International Journal of Computer Vision, 2016, 120 : 28 - 43
  • [4] Coarse-to-Fine Localization of Temporal Action Proposals
    Long, Fuchen
    Yao, Ting
    Qiu, Zhaofan
    Tian, Xinmei
    Mei, Tao
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) : 1577 - 1590
  • [5] Temporal Action Localization With Coarse-to-Fine Network
    Zhang, Min
    Hu, Haiyang
    Li, Zhongjin
    [J]. IEEE ACCESS, 2022, 10 : 96378 - 96387
  • [6] FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
    Liu, Yi
    Wang, Limin
    Wang, Yali
    Ma, Xiao
    Qiao, Yu
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6937 - 6950
  • [7] A Coarse-to-Fine Boundary Localization method for Naturalistic Driving Action Recognition
    Ding, Guanchen
    Han, Wenwei
    Wang, Chenglong
    Cui, Mingpeng
    Zhou, Lin
    Pan, Dianbo
    Wang, Jiayi
    Zhang, Junxi
    Chen, Zhenzhong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 3233 - 3240
  • [8] Coarse-to-Fine Description for Fine-Grained Visual Categorization
    Yao, Hantao
    Zhang, Shiliang
    Zhang, Yongdong
    Li, Jintao
    Tian, Qi
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (10) : 4858 - 4872
  • [9] Discriminative Segment Focus Network for Fine-grained Video Action Recognition
    Sun, Baoli
    Ye, Xinchen
    Yan, Tiantian
    Wang, Zhihui
    Li, Haojie
    Wang, Zhiyong
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [10] A coarse-to-fine capsule network for fine-grained image categorization
    Lin, Zhongqi
    Jia, Jingdun
    Huang, Feng
    Gao, Wanlin
    [J]. NEUROCOMPUTING, 2021, 456 : 200 - 219