Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

被引:2
|
作者
Sun, Baoli [1 ]
Ye, Xinchen [1 ]
Wang, Zhihui [1 ]
Li, Haojie [2 ]
Wang, Zhiyong [3 ]
机构
[1] Dalian Univ Technol, Int Sch Informat Sci & Engn, Dalian, Peoples R China
[2] Shandong Univ Sci & Technol, Coll Comp Sci & Engn, Qingdao, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Fine-grained; action recognition; token localization and interaction; vision transformer;
D O I
10.1145/3581783.3612206
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers have achieved impressive performance for video action recognition due to their strong capability of modeling long-range dependencies among spatio-temporal tokens. However, as for fine-grained actions, subtle and discriminative differences mainly exist in the regions of actors, directly utilizing vision transformers without removing irrelevant tokens will compromise recognition performance and lead to high computational costs. In this paper, we propose a coarse-to-fine action token localization and interaction network, namely C2F-ALIN, that dynamically localizes the most informative tokens at a coarse granularity and then partitions these located tokens to a fine granularity for sufficient fine-grained spatio-temporal interaction. Specifically, in the coarse stage, we devise a discriminative token localization module to accurately identify informative tokens and to discard irrelevant tokens, where each localized token corresponds to a large spatial region, thus effectively preserving the continuity of action regions. In the fine stage, we only further partition the localized tokens obtained in the coarse stage into a finer granularity and then characterize fine-grained token interactions in two aspects: (1) first using vanilla transformers to learn compact dependencies among all discriminative tokens; and (2) proposing a global contextual interaction module which enables each fine-grained tokens to communicate with all the spatio-temporal tokens and to embed the global context. As a result, our coarse-to-fine strategy is able to identify more relevant tokens and integrate global context for high recognition accuracy while maintaining high efficiency. Comprehensive experimental results on four widely used action recognition benchmarks, including FineGym, Diving48, Kinetics and Something-Something, clearly demonstrate the advantages of our proposed method in comparison with other state-of-the-art ones.
引用
收藏
页码:5070 / 5078
页数:9
相关论文
共 50 条
  • [21] Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition
    Hong, James
    Fisher, Matthew
    Gharbi, Michael
    Fatahalian, Kayvon
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9234 - 9243
  • [22] Coarse-to-Fine Grained Classification
    Huo, Yuqi
    Lu, Yao
    Niu, Yulei
    Lu, Zhiwu
    Wen, Ji-Rong
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1033 - 1036
  • [23] FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
    Shao, Dian
    Zhao, Yue
    Dai, Bo
    Lin, Dahua
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2613 - 2622
  • [24] Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion
    Lin, Weiyao
    Mi, Yang
    Wu, Jianxin
    Lu, Ke
    Xiong, Hongkai
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7130 - 7137
  • [25] Fine-Grained Crowdsourcing for Fine-Grained Recognition
    Jia Deng
    Krause, Jonathan
    Li Fei-Fei
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587
  • [26] Hand Detection and Tracking in Videos for Fine-Grained Action Recognition
    Do, Nga H.
    Yanai, Keiji
    [J]. COMPUTER VISION - ACCV 2014 WORKSHOPS, PT I, 2015, 9008 : 19 - 34
  • [27] Periodic-Aware Network for Fine-Grained Action Recognition
    Luo, Senzi
    Xiao, Jiayin
    Li, Dong
    Jian, Muwei
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VIII, 2024, 14432 : 105 - 117
  • [28] Pipelining Localized Semantic Features for Fine-Grained Action Recognition
    Zhou, Yang
    Ni, Bingbing
    Yan, Shuicheng
    Moulin, Pierre
    Tian, Qi
    [J]. COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 : 481 - 496
  • [29] DUAL TEMPORAL TRANSFORMERS FOR FINE-GRAINED DANGEROUS ACTION RECOGNITION
    Song, Wenfeng
    Jin, Xingliang
    Ding, Yang
    Gao, Yang
    Hou, Xia
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 415 - 419
  • [30] ACTION AND CRIME - A FINE-GRAINED APPROACH
    GOLDMAN, AI
    [J]. UNIVERSITY OF PENNSYLVANIA LAW REVIEW, 1994, 142 (05) : 1563 - 1586