Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

被引:16
|
作者
Monfort, Mathew [1 ]
Jin, SouYoung [1 ]
Liu, Alexander [1 ]
Harwath, David [2 ]
Feris, Rogerio [3 ]
Glass, James [1 ]
Oliva, Aude [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] UT Austin, Austin, TX USA
[3] IBM Res, Yorktown Hts, NY USA
关键词
D O I
10.1109/CVPR46437.2021.01463
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other videocaption datasets. http://moments.csail.mit.edu/spoken.html
引用
收藏
页码:14866 / 14876
页数:11
相关论文
共 50 条
  • [1] Learning Representations from Audio-Visual Spatial Alignment
    Morgado, Pedro
    Li, Yi
    Vasconcelos, Nuno
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [3] Indexing audio-visual sequences by joint audio and video processing
    Saraceno, C
    Leonardi, R
    [J]. VSMM98: FUTUREFUSION - APPLICATION REALITIES FOR THE VIRTUAL AGE, VOLS 1 AND 2, 1998, : 686 - 691
  • [4] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
    Rouditchenko, Andrew
    Boggust, Angie
    Harwath, David
    Chen, Brian
    Joshi, Dhiraj
    Thomas, Samuel
    Audhkhasi, Kartik
    Kuehne, Hilde
    Panda, Rameswar
    Feris, Rogerio
    Kingsbury, Brian
    Picheny, Michael
    Torralba, Antonio
    Glass, James
    [J]. INTERSPEECH 2021, 2021, : 1584 - 1588
  • [5] Audio-Visual Biometric Recognition Via Joint Sparse Representations
    Primorac, Rudi
    Togneri, Roberto
    Bennamoun, Mohammed
    Sohel, Ferdous
    [J]. 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3031 - 3035
  • [6] Identification of story units in audio-visual sequences by joint audio and video processing
    Saraceno, C
    Leonardi, R
    [J]. 1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 363 - 367
  • [7] Discovering joint audio-visual codewords for video event detection
    Jhuo, I-Hong
    Ye, Guangnan
    Gao, Shenghua
    Liu, Dong
    Jiang, Yu-Gang
    Lee, D. T.
    Chang, Shih-Fu
    [J]. MACHINE VISION AND APPLICATIONS, 2014, 25 (01) : 33 - 47
  • [8] Interactive learning of spoken words and their meanings through an audio-visual interface
    Iwahashi, Naoto
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (02) : 312 - 321
  • [9] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [10] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479