AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

被引:10
|
作者
Rouditchenko, Andrew [1 ]
Boggust, Angie [1 ]
Harwath, David [2 ]
Chen, Brian [3 ]
Joshi, Dhiraj [4 ]
Thomas, Samuel [4 ]
Audhkhasi, Kartik [5 ]
Kuehne, Hilde [4 ]
Panda, Rameswar [4 ]
Feris, Rogerio [4 ]
Kingsbury, Brian [4 ]
Picheny, Michael [6 ]
Torralba, Antonio [1 ]
Glass, James [1 ]
机构
[1] MIT CSAIL, Cambridge, MA 02139 USA
[2] UT Austin, Austin, TX USA
[3] Columbia Univ, New York, NY 10027 USA
[4] IBM Res AI, Yorktown Hts, NY USA
[5] Google, Mountain View, CA 94043 USA
[6] NYU, New York, NY 10003 USA
来源
关键词
audio-visual; multimodal learning; selfsupervised learning; video retrieval; spoken captions;
D O I
10.21437/Interspeech.2021-1312
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. Finally, we perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts.
引用
收藏
页码:1584 / 1588
页数:5
相关论文
共 50 条
  • [31] An Audio-Visual Approach to Modern Language Teaching
    Mueller, Theodore
    [J]. MODERN LANGUAGE JOURNAL, 1955, 39 (05): : 237 - 239
  • [32] Audio-Visual Aids in the Secondary Language Curriculum
    von Wernsdorff, Wolff
    [J]. MODERN LANGUAGE JOURNAL, 1948, 32 (05): : 346 - 350
  • [33] Celebrating excellence in audio-visual representations in market research
    Caldwell, Marylouise
    [J]. QUALITATIVE MARKET RESEARCH, 2010, 13 (01):
  • [34] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
    Qamar, Maryam
    Qamar, Suleman
    Muneeb, Muhammad
    Bae, Sung-Ho
    Rahman, Anis
    [J]. IEEE ACCESS, 2023, 11 : 15460 - 15470
  • [35] Audio-visual object removal in 360-degree videos
    Shimamura, Ryo
    Feng, Qi
    Koyama, Yuki
    Nakatsuka, Takayuki
    Fukayama, Satoru
    Hamasaki, Masahiro
    Goto, Masataka
    Morishima, Shigeo
    [J]. VISUAL COMPUTER, 2020, 36 (10-12): : 2117 - 2128
  • [36] Cases on Audio-Visual Media in Language Education
    Turel, Vehbi
    [J]. CALICO JOURNAL, 2020, 37 (02): : 201 - 204
  • [37] Synchronization of Multiple Camera Videos Using Audio-Visual Features
    Shrestha, Prarthana
    Barbieri, Mauro
    Weda, Hans
    Sekulovski, Dragan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (01) : 79 - 92
  • [38] A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence
    Min, Xiongkuo
    Zhai, Guangtao
    Zhou, Jiantao
    Zhang, Xiao-Ping
    Yang, Xiaokang
    Guan, Xinping
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3805 - 3819
  • [39] Transfer Learning from Audio-Visual Grounding to Speech Recognition
    Hsu, Wei-Ning
    Harwath, David
    Glass, James
    [J]. INTERSPEECH 2019, 2019, : 3242 - 3246
  • [40] An Analysis of the BBC Documentary "the Earth" from the Perspective of Audio-visual Language
    Gao, Qingxue
    [J]. PROCEEDINGS OF 4TH INTERNATIONAL CONFERENCE ON EDUCATION, LANGUAGE, ART AND INTERCULTURAL COMMUNICATION (ICELAIC 2017), 2017, 142 : 617 - 619