AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

被引:10
|
作者
Rouditchenko, Andrew [1 ]
Boggust, Angie [1 ]
Harwath, David [2 ]
Chen, Brian [3 ]
Joshi, Dhiraj [4 ]
Thomas, Samuel [4 ]
Audhkhasi, Kartik [5 ]
Kuehne, Hilde [4 ]
Panda, Rameswar [4 ]
Feris, Rogerio [4 ]
Kingsbury, Brian [4 ]
Picheny, Michael [6 ]
Torralba, Antonio [1 ]
Glass, James [1 ]
机构
[1] MIT CSAIL, Cambridge, MA 02139 USA
[2] UT Austin, Austin, TX USA
[3] Columbia Univ, New York, NY 10027 USA
[4] IBM Res AI, Yorktown Hts, NY USA
[5] Google, Mountain View, CA 94043 USA
[6] NYU, New York, NY 10003 USA
来源
关键词
audio-visual; multimodal learning; selfsupervised learning; video retrieval; spoken captions;
D O I
10.21437/Interspeech.2021-1312
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. Finally, we perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts.
引用
收藏
页码:1584 / 1588
页数:5
相关论文
共 50 条
  • [1] Cascaded Multilingual Audio-Visual Learning from Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Thomas, Samuel
    Kuehne, Hilde
    Chen, Brian
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Glass, James
    [J]. INTERSPEECH 2021, 2021, : 3006 - 3010
  • [2] Learning Representations from Audio-Visual Spatial Alignment
    Morgado, Pedro
    Li, Yi
    Vasconcelos, Nuno
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [3] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [4] Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
    Monfort, Mathew
    Jin, SouYoung
    Liu, Alexander
    Harwath, David
    Feris, Rogerio
    Glass, James
    Oliva, Aude
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14866 - 14876
  • [5] SpeechIndexer: A Flexible Software for Audio-Visual Language Learning
    Glavitsch, Ulrike
    Simon, Klaus
    Szakos, Jozsef
    [J]. ICEIC 2011/ IRE&PS 2011: INTERNATIONAL CONFERENCE ON EDUCATION, INFORMATICS, AND CYBERNETICS/ INTERNATIONAL SYMPOSIUM ON INTEGRATING RESEARCH, EDUCATION, AND PROBLEM SOLVING, 2011, : 79 - 82
  • [6] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
  • [7] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [8] Support system for making audio-visual material for learning language
    Tobe, Yuichi
    Fujita, Shinichi
    Hosaka, Toshiko
    [J]. 2006 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY BASED HIGHER EDUCATION AND TRAINING, VOLS 1 AND 2, 2006, : 199 - 202
  • [9] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    [J]. ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [10] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
    Ma, Fei
    Zhang, Wei
    Li, Yang
    Huang, Shao-Lun
    Zhang, Lin
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23