Listen to Look: Action Recognition by Previewing Audio

被引：132

作者：

Gao, Ruohan ^{[1
,2
]}

Oh, Tae-Hyun ^{[2
,3
]}

Grauman, Kristen ^{[1
,2
]}

Torresani, Lorenzo ^{[2
]}

机构：

[1] Univ Texas Austin, Austin, TX 78712 USA

[2] Facebook AI Res, Austin, TX 78701 USA

[3] POSTECH, Dept EE, Pohang, South Korea

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01047

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redun-dancies. First, we devise an IMGAUD2VID framework that hallucinates clip-level features by distilling from lighter modalities-a single frame and its accompanying audio-reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on IMGAUD2VID, we further propose IMGAUD-SKIMMING, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.

引用

页码：10454 / 10464

页数：11

共 50 条

[1] Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition
Hammoud, Hasan Abed Al Kader
Liu, Shuming
Alkhrashi, Mohammed
AlBalawi, Fahad
Ghanem, Bernard
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 3439 - 3450
[2] Listen to Look Into the Future: Audio-Visual Egocentric Gaze Anticipation
Lai, Bolin
Ryan, Fiona
Jia, Wenqi
Liu, Miao
Rehg, James M.
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 192 - 210
[3] LOOK, LISTEN, AND LEARN MORE: DESIGN CHOICES FOR DEEP AUDIO EMBEDDINGS
Cramer, Jason
Wu, Ho-Hsiang
Salamon, Justin
Bello, Juan Pablo
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3852 - 3856
[4] Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
Gan, Chuang
Zhang, Yiwei
Wu, Jiajun
Gong, Boqing
Tenenbaum, Joshua B.
2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2020, : 9701 - 9707
[5] LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES
Sun, Felix
Harwath, David
Glass, James
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 573 - 578
[6] Look, feel, listen or look, listen, feel?
Harris, Samar
Naina, Harris V. K.
Kuppachi, Sarat
AMERICAN JOURNAL OF MEDICINE, 2007, 120 (02):
[7] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
Lu, Rui
Duan, Zhiyao
Zhang, Changshui
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
[8] LOOK AND LISTEN
Farley, Belmont
JOURNAL OF EDUCATIONAL SOCIOLOGY, 1941, 14 (09): : 521 - 523
[9] LISTEN AND LOOK
MORAN, WB
SOUTHERN MEDICAL JOURNAL, 1974, 67 (06) : 638 - 639
[10] Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
Alexanderson, Simon
Nagy, Rajmund
Beskow, Jonas
Henter, Gustav Eje
ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):

← 1 2 3 4 5 →