Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

被引:7
|
作者
Lee, Pilhyeon [1 ]
Kim, Taeoh [2 ]
Shim, Minho [2 ]
Wee, Dongyoon [2 ]
Byun, Hyeran [1 ]
机构
[1] Yonsei Univ, Seoul, South Korea
[2] Naver Cloud, AI Tech, Seongnam, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.00235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.
引用
收藏
页码:2373 / 2383
页数:11
相关论文
共 50 条
  • [1] Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection
    Dai, Rui
    Das, Srijan
    Bremond, Francois
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13033 - 13044
  • [2] CROSS-MODAL KNOWLEDGE DISTILLATION FOR ACTION RECOGNITION
    Thoker, Fida Mohammad
    Gall, Juergen
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 6 - 10
  • [3] Cross-Modal Adaptation for RGB-D Detection
    Hoffman, Judy
    Gupta, Saurabh
    Leong, Jian
    Guadarrama, Sergio
    Darrell, Trevor
    2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 5032 - 5039
  • [4] Cross-modal distillation for RGB-depth person re-identification
    Hafner F.M.
    Bhuyian A.
    Kooij J.F.P.
    Granger E.
    Computer Vision and Image Understanding, 2022, 216
  • [5] Progressive Cross-modal Knowledge Distillation for Human Action Recognition
    Ni, Jianyuan
    Ngu, Anne H. H.
    Yan, Yan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5903 - 5912
  • [6] CROSS-MODAL KNOWLEDGE DISTILLATION IN MULTI-MODAL FAKE NEWS DETECTION
    Wei, Zimian
    Pan, Hengyue
    Qiao, Linbo
    Niu, Xin
    Dong, Peijie
    Li, Dongsheng
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4733 - 4737
  • [7] CROSS-MODAL KNOWLEDGE DISTILLATION FOR VISION-TO-SENSOR ACTION RECOGNITION
    Ni, Jianyuan
    Sarbajna, Raunak
    Liu, Yang
    Ngu, Anne H. H.
    Yan, Yan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4448 - 4452
  • [8] Cross-modal collaborative propagation for RGB-T saliency detection
    Yu, Xiaosheng
    Pang, Yu
    Chi, Jianning
    Qi, Qi
    VISUAL COMPUTER, 2024, 40 (06): : 4337 - 4354
  • [9] Cross-Modal Distillation for Speaker Recognition
    Jin, Yufeng
    Hu, Guosheng
    Chen, Haonan
    Miao, Duoqian
    Hu, Liang
    Zhao, Cairong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12977 - 12985
  • [10] Rhythmicity and cross-modal temporal cues facilitate detection
    ten Oever, Sanne
    Schroeder, Charles E.
    Poeppel, David
    van Atteveldt, Nienke
    Zion-Golumbic, Elana
    NEUROPSYCHOLOGIA, 2014, 63 : 43 - 50