ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

被引:3
|
作者
Lee, Sangho [1 ]
Chung, Jiwan [1 ]
Yu, Youngjae [1 ]
Kim, Gunhee [1 ]
Breuel, Thomas [2 ]
Chechik, Gal [2 ]
Song, Yale [3 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] NVIDIA Res, Shanghai, Peoples R China
[3] Microsoft Res, Redmond, WA USA
关键词
INFORMATION; CLUSTERINGS;
D O I
10.1109/ICCV48922.2021.01011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised approaches rely on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audiovisual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.
引用
收藏
页码:10254 / 10264
页数:11
相关论文
共 24 条
  • [1] Audio-visual large-scale video copy detection
    Liu, Yang
    Xu, Changsheng
    Lu, Hanqing
    [J]. INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
  • [2] VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET
    Chen, Honglie
    Xie, Weidi
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 721 - 725
  • [3] On-the-fly learning for visual search of large-scale image and video datasets
    Chatfield, Ken
    Arandjelovic, Relja
    Parkhi, Omkar
    Zisserman, Andrew
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2015, 4 (02) : 75 - 93
  • [4] ON ADVERSARIAL ROBUSTNESS OF LARGE-SCALE AUDIO VISUAL LEARNING
    Li, Juncheng B.
    Qu, Shuhui
    Li, Xinjian
    Huang, Po-Yao
    Metze, Florian
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 231 - 235
  • [5] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Zhu, Dandan
    Shao, Xuan
    Zhang, Kaiwei
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. APPLIED INTELLIGENCE, 2023, 53 (19) : 22615 - 22634
  • [6] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Dandan Zhu
    Xuan Shao
    Kaiwei Zhang
    Xiongkuo Min
    Guangtao Zhai
    Xiaokang Yang
    [J]. Applied Intelligence, 2023, 53 : 22615 - 22634
  • [7] A Large-scale Depth-based Multimodal Audio-Visual Corpus in Mandarin
    Wang, Jianrong
    Wang, Liyuan
    Zhang, Ju
    Yu, Mei
    Yu, Ruiguo
    Wei, Jianguo
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 881 - 885
  • [8] TOWARDS A LARGE-SCALE AUDIO-VISUAL CORPUS FOR RESEARCH ON AMYOTROPHIC LATERAL SCLEROSIS
    Anvar, Aria
    Suendermann-Oeft, David
    Pautler, David
    Ramanarayanan, Vikram
    Kumm, Jochen
    Norel, Raquel
    Fraenkel, Ernest
    Navar, Indu
    [J]. NEUROLOGY, 2021, 96 (15)
  • [9] Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks
    Schindler, Alexander
    Boyer, Martin
    Lindley, Andrew
    Schreiber, David
    Philipp, Thomas
    [J]. MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 106 - 119
  • [10] Automated curation of large-scale cancer histopathology image datasets using deep learning
    Hilgers, Lars
    Laleh, Narmin Ghaffari
    West, Nicholas P.
    Westwood, Alice
    Hewitt, Katherine J.
    Quirke, Philip
    Grabsch, Heike, I
    Carrero, Zunamys, I
    Matthaei, Emylou
    Loeffler, Chiara M. L.
    Brinker, Titus J.
    Yuan, Tanwei
    Brenner, Hermann
    Brobeil, Alexander
    Hoffmeister, Michael
    Kather, Jakob Nikolas
    [J]. HISTOPATHOLOGY, 2024, 84 (07) : 1139 - 1153