Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

被引:16
|
作者
Monfort, Mathew [1 ]
Jin, SouYoung [1 ]
Liu, Alexander [1 ]
Harwath, David [2 ]
Feris, Rogerio [3 ]
Glass, James [1 ]
Oliva, Aude [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] UT Austin, Austin, TX USA
[3] IBM Res, Yorktown Hts, NY USA
关键词
D O I
10.1109/CVPR46437.2021.01463
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people generate for videos of different dynamic events can greatly improve our understanding of the key information of interest in each video. These descriptions can be captured in captions that provide expanded attributes for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing us to gain new insight into what people find important or necessary to summarize specific events. Existing caption datasets for video understanding are either small in scale or restricted to a specific domain. To address this, we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible while allowing us to scale the size of a large classification dataset. In order to utilize our proposed dataset, we present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and evaluate our models on video/caption retrieval on multiple datasets. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other videocaption datasets. http://moments.csail.mit.edu/spoken.html
引用
收藏
页码:14866 / 14876
页数:11
相关论文
共 50 条
  • [31] Combining audio and video metrics to assess audio-visual quality
    Becerra Martinez, Helard A.
    Farias, Mylene C. Q.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (18) : 23993 - 24012
  • [32] Advertising video as a kind of audio-visual production
    Zarya, Svitlana
    [J]. NATIONAL ACADEMY OF MANAGERIAL STAFF OF CULTURE AND ARTS HERALD, 2016, (02): : 94 - 98
  • [33] An audio-visual approach to web video categorization
    Ionescu, Bogdan Emanuel
    Seyerlehner, Klaus
    Mironica, Ionut
    Vertan, Constantin
    Lambert, Patrick
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2014, 70 (02) : 1007 - 1032
  • [34] Audio-visual Privacy Protection for Video Conference
    Venkatesh, M. Vijay
    Zhao, Jian
    Profitt, Larry
    Cheung, Sen-ching S.
    [J]. ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 1574 - 1575
  • [35] Combining audio and video metrics to assess audio-visual quality
    Helard A. Becerra Martinez
    Mylène C. Q. Farias
    [J]. Multimedia Tools and Applications, 2018, 77 : 23993 - 24012
  • [36] Video concept detection by audio-visual grouplets
    Wei Jiang
    Alexander C. Loui
    [J]. International Journal of Multimedia Information Retrieval, 2012, 1 (4) : 223 - 238
  • [37] VIDEO CODING BASED ON AUDIO-VISUAL ATTENTION
    Lee, Jong-Seok
    De Simone, Francesca
    Ebrahimi, Touradj
    [J]. ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 57 - 60
  • [38] Audio-Visual Emotion Recognition in Video Clips
    Noroozi, Fatemeh
    Marjanovic, Marina
    Njegus, Angelina
    Escalera, Sergio
    Anbarjafari, Gholamreza
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (01) : 60 - 75
  • [39] A audio-visual model for efficient video summarization
    El-Nagar, Gamal
    El-Sawy, Ahmed
    Rashad, Metwally
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [40] An audio-visual approach to web video categorization
    Bogdan Emanuel Ionescu
    Klaus Seyerlehner
    Ionuţ Mironică
    Constantin Vertan
    Patrick Lambert
    [J]. Multimedia Tools and Applications, 2014, 70 : 1007 - 1032