Listen to Look: Action Recognition by Previewing Audio

被引:132
|
作者
Gao, Ruohan [1 ,2 ]
Oh, Tae-Hyun [2 ,3 ]
Grauman, Kristen [1 ,2 ]
Torresani, Lorenzo [2 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Facebook AI Res, Austin, TX 78701 USA
[3] POSTECH, Dept EE, Pohang, South Korea
关键词
D O I
10.1109/CVPR42600.2020.01047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redun-dancies. First, we devise an IMGAUD2VID framework that hallucinates clip-level features by distilling from lighter modalities-a single frame and its accompanying audio-reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on IMGAUD2VID, we further propose IMGAUD-SKIMMING, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.
引用
收藏
页码:10454 / 10464
页数:11
相关论文
共 50 条
  • [1] Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition
    Hammoud, Hasan Abed Al Kader
    Liu, Shuming
    Alkhrashi, Mohammed
    AlBalawi, Fahad
    Ghanem, Bernard
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 3439 - 3450
  • [2] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
    Lai, Bolin
    Ryan, Fiona
    Jia, Wenqi
    Liu, Miao
    Rehg, James M.
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
  • [3] LOOK, LISTEN, AND LEARN MORE: DESIGN CHOICES FOR DEEP AUDIO EMBEDDINGS
    Cramer, Jason
    Wu, Ho-Hsiang
    Salamon, Justin
    Bello, Juan Pablo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3852 - 3856
  • [4] Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
    Gan, Chuang
    Zhang, Yiwei
    Wu, Jiajun
    Gong, Boqing
    Tenenbaum, Joshua B.
    2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2020, : 9701 - 9707
  • [5] LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES
    Sun, Felix
    Harwath, David
    Glass, James
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 573 - 578
  • [6] Look, feel, listen or look, listen, feel?
    Harris, Samar
    Naina, Harris V. K.
    Kuppachi, Sarat
    AMERICAN JOURNAL OF MEDICINE, 2007, 120 (02):
  • [7] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
  • [8] LOOK AND LISTEN
    Farley, Belmont
    JOURNAL OF EDUCATIONAL SOCIOLOGY, 1941, 14 (09): : 521 - 523
  • [9] LISTEN AND LOOK
    MORAN, WB
    SOUTHERN MEDICAL JOURNAL, 1974, 67 (06) : 638 - 639
  • [10] Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
    Alexanderson, Simon
    Nagy, Rajmund
    Beskow, Jonas
    Henter, Gustav Eje
    ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):