AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

被引：10

作者：

Rouditchenko, Andrew ^{[1
]}

Boggust, Angie ^{[1
]}

Harwath, David ^{[2
]}

Chen, Brian ^{[3
]}

Joshi, Dhiraj ^{[4
]}

Thomas, Samuel ^{[4
]}

Audhkhasi, Kartik ^{[5
]}

Kuehne, Hilde ^{[4
]}

Panda, Rameswar ^{[4
]}

Feris, Rogerio ^{[4
]}

Kingsbury, Brian ^{[4
]}

Picheny, Michael ^{[6
]}

Torralba, Antonio ^{[1
]}

Glass, James ^{[1
]}

机构：

[1] MIT CSAIL, Cambridge, MA 02139 USA

[2] UT Austin, Austin, TX USA

[3] Columbia Univ, New York, NY 10027 USA

[4] IBM Res AI, Yorktown Hts, NY USA

[5] Google, Mountain View, CA 94043 USA

[6] NYU, New York, NY 10003 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

audio-visual; multimodal learning; selfsupervised learning; video retrieval; spoken captions;

D O I：

10.21437/Interspeech.2021-1312

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. Finally, we perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts.

引用

页码：1584 / 1588

页数：5

共 50 条

[1] Cascaded Multilingual Audio-Visual Learning from Videos
Rouditchenko, Andrew
Boggust, Angie
Harwath, David
Thomas, Samuel
Kuehne, Hilde
Chen, Brian
Panda, Rameswar
Feris, Rogerio
Kingsbury, Brian
Picheny, Michael
Glass, James
[J]. INTERSPEECH 2021, 2021, : 3006 - 3010
[2] Learning Representations from Audio-Visual Spatial Alignment
Morgado, Pedro
Li, Yi
Vasconcelos, Nuno
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[3] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[4] Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Monfort, Mathew
Jin, SouYoung
Liu, Alexander
Harwath, David
Feris, Rogerio
Glass, James
Oliva, Aude
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14866 - 14876
[5] SpeechIndexer: A Flexible Software for Audio-Visual Language Learning
Glavitsch, Ulrike
Simon, Klaus
Szakos, Jozsef
[J]. ICEIC 2011/ IRE&PS 2011: INTERNATIONAL CONFERENCE ON EDUCATION, INFORMATICS, AND CYBERNETICS/ INTERNATIONAL SYMPOSIUM ON INTEGRATING RESEARCH, EDUCATION, AND PROBLEM SOLVING, 2011, : 79 - 82
[6] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
Wang, Shanshan
Politis, Archontis
Mesaros, Annamaria
Virtanen, Tuomas
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
[7] Audio-Visual Event Localization in Unconstrained Videos
Tian, Yapeng
Shi, Jing
Li, Bochen
Duan, Zhiyao
Xu, Chenliang
[J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
[8] Support system for making audio-visual material for learning language
Tobe, Yuichi
Fujita, Shinichi
Hosaka, Toshiko
[J]. 2006 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY BASED HIGHER EDUCATION AND TRAINING, VOLS 1 AND 2, 2006, : 199 - 202
[9] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
Krishnamurthy, Sudha
[J]. ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
[10] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23

← 1 2 3 4 5 →