Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

被引:7
|
作者
Lee, Pilhyeon [1 ]
Kim, Taeoh [2 ]
Shim, Minho [2 ]
Wee, Dongyoon [2 ]
Byun, Hyeran [1 ]
机构
[1] Yonsei Univ, Seoul, South Korea
[2] Naver Cloud, AI Tech, Seongnam, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.00235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.
引用
收藏
页码:2373 / 2383
页数:11
相关论文
共 50 条
  • [21] RGB-D salient object detection with asymmetric cross-modal fusion
    Yu M.
    Xing Z.-H.
    Liu Y.
    Kongzhi yu Juece/Control and Decision, 2023, 38 (09): : 2487 - 2495
  • [22] Cross-modal detection using various temporal and spatial configurations
    James A. Schirillo
    Attention, Perception, & Psychophysics, 2011, 73 : 237 - 246
  • [23] Object-centric Cross-modal Feature Distillation for Event-based Object Detection
    Li, Lei
    Linger, Alexander
    Millhausler, Mario
    Tsiminaki, Vagia
    Li, Yuanyou
    Dai, Dengxin
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 15440 - 15447
  • [24] Cross-modal detection using various temporal and spatial configurations
    Schirillo, James A.
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2011, 73 (01) : 237 - 246
  • [25] STXD: Structural and Temporal Cross-Modal Distillation for Multi-View 3D Object Detection
    Jang, Sujin
    Jo, Dae Ung
    Hwang, Sung Ju
    Lee, Dongwook
    Ji, Daehyun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [26] Cross-modal distillation for flood extent mapping
    Garg, Shubhika
    Feinstein, Ben
    Timnat, Shahar
    Batchu, Vishal
    Dror, Gideon
    Rosenthal, Adi Gerzi
    Gulshan, Varun
    ENVIRONMENTAL DATA SCIENCE, 2023, 2
  • [27] Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization
    Hong, Fa-Ting
    Feng, Jia-Chang
    Xu, Dan
    Shan, Ying
    Zheng, Wei-Shi
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1591 - 1599
  • [28] RGB-D Saliency Detection based on Cross-Modal and Multi-scale Feature Fusion
    Zhu, Xuxing
    Wu, Jin
    Zhu, Lei
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 6154 - 6160
  • [29] Transformer-Based Cross-Modal Integration Network for RGB-T Salient Object Detection
    Lv, Chengtao
    Zhou, Xiaofei
    Wan, Bin
    Wang, Shuai
    Sun, Yaoqi
    Zhang, Jiyong
    Yan, Chenggang
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (02) : 4741 - 4755
  • [30] RGB-D Salient Object Detection Based on Cross-modal Interactive Fusion and Global Awareness
    Sun F.-M.
    Hu X.-H.
    Wu J.-Y.
    Sun J.
    Wang F.-S.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (04): : 1899 - 1913