Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation

被引:0
|
作者
Ghosh, Sayontan [1 ]
Aggarwal, Tanvi [1 ]
Hoai, Minh [1 ]
Balasubramanian, Niranjan [1 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Anticipating future actions in a video is useful for many autonomous and assistive technologies. Most prior action anticipation work treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision-based action anticipation models. We show that a simple distillation technique can achieve effective knowledge transfer and provide consistent gains on a strong vision model (Anticipative Vision Transformer) for two action anticipation datasets (3.5% relative gain on EGTEA-GAZE+ and 7.2% relative gain on EPIC-KITCHEN 55), giving a new state-of-the-art result.
引用
收藏
页码:1882 / 1897
页数:16
相关论文
共 31 条
  • [1] CROSS-MODAL KNOWLEDGE DISTILLATION FOR VISION-TO-SENSOR ACTION RECOGNITION
    Ni, Jianyuan
    Sarbajna, Raunak
    Liu, Yang
    Ngu, Anne H. H.
    Yan, Yan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4448 - 4452
  • [2] CROSS-MODAL KNOWLEDGE DISTILLATION FOR ACTION RECOGNITION
    Thoker, Fida Mohammad
    Gall, Juergen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 6 - 10
  • [3] Video-Based Cross-Modal Recipe Retrieval
    Cao, Da
    Yu, Zhiwang
    Zhang, Hanling
    Fang, Jiansheng
    Nie, Liqiang
    Tian, Qi
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1685 - 1693
  • [4] Progressive Cross-modal Knowledge Distillation for Human Action Recognition
    Ni, Jianyuan
    Ngu, Anne H. H.
    Yan, Yan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5903 - 5912
  • [5] Vision-Text Cross-Modal Fusion for Accurate Video Captioning
    Ouenniche, Kaouther
    Tapu, Ruxandra
    Zaharia, Titus
    [J]. IEEE ACCESS, 2023, 11 : 115477 - 115492
  • [6] Cross-Modal Knowledge Distillation with Dropout-Based Confidence
    Cho, Won Ik
    Kim, Jeunghun
    Kim, Nam Soo
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 653 - 657
  • [7] XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
    Sarkar, Pritam
    Etemad, Ali
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14875 - 14885
  • [8] Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis
    Chen, Rongfei
    Zhou, Wenju
    Li, Yang
    Zhou, Huiyu
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8703 - 8716
  • [9] Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection
    Dai, Rui
    Das, Srijan
    Bremond, Francois
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13033 - 13044
  • [10] Social Image-Text Sentiment Classification With Cross-Modal Consistency and Knowledge Distillation
    Liu, Huan
    Li, Ke
    Fan, Jianping
    Yan, Caixia
    Qin, Tao
    Zheng, Qinghua
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3332 - 3344