Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

被引：16

作者：

Monfort, Mathew ^{[1
]}

Jin, SouYoung ^{[1
]}

Liu, Alexander ^{[1
]}

Harwath, David ^{[2
]}

Feris, Rogerio ^{[3
]}

Glass, James ^{[1
]}

Oliva, Aude ^{[1
]}

机构：

[1] MIT, Cambridge, MA 02139 USA

[2] UT Austin, Austin, TX USA

[3] IBM Res, Yorktown Hts, NY USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.01463

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other videocaption datasets. http://moments.csail.mit.edu/spoken.html

引用

页码：14866 / 14876

页数：11

共 50 条

[1] Learning Representations from Audio-Visual Spatial Alignment
Morgado, Pedro
Li, Yi
Vasconcelos, Nuno
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[2] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[3] Indexing audio-visual sequences by joint audio and video processing
Saraceno, C
Leonardi, R
[J]. VSMM98: FUTUREFUSION - APPLICATION REALITIES FOR THE VIRTUAL AGE, VOLS 1 AND 2, 1998, : 686 - 691
[4] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Rouditchenko, Andrew
Boggust, Angie
Harwath, David
Chen, Brian
Joshi, Dhiraj
Thomas, Samuel
Audhkhasi, Kartik
Kuehne, Hilde
Panda, Rameswar
Feris, Rogerio
Kingsbury, Brian
Picheny, Michael
Torralba, Antonio
Glass, James
[J]. INTERSPEECH 2021, 2021, : 1584 - 1588
[5] Audio-Visual Biometric Recognition Via Joint Sparse Representations
Primorac, Rudi
Togneri, Roberto
Bennamoun, Mohammed
Sohel, Ferdous
[J]. 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3031 - 3035
[6] Identification of story units in audio-visual sequences by joint audio and video processing
Saraceno, C
Leonardi, R
[J]. 1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 363 - 367
[7] Discovering joint audio-visual codewords for video event detection
Jhuo, I-Hong
Ye, Guangnan
Gao, Shenghua
Liu, Dong
Jiang, Yu-Gang
Lee, D. T.
Chang, Shih-Fu
[J]. MACHINE VISION AND APPLICATIONS, 2014, 25 (01) : 33 - 47
[8] Interactive learning of spoken words and their meanings through an audio-visual interface
Iwahashi, Naoto
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (02) : 312 - 321
[9] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[10] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
Wang, Shanshan
Politis, Archontis
Mesaros, Annamaria
Virtanen, Tuomas
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479

← 1 2 3 4 5 →