ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

被引：3

作者：

Lee, Sangho ^{[1
]}

Chung, Jiwan ^{[1
]}

Yu, Youngjae ^{[1
]}

Kim, Gunhee ^{[1
]}

Breuel, Thomas ^{[2
]}

Chechik, Gal ^{[2
]}

Song, Yale ^{[3
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] NVIDIA Res, Shanghai, Peoples R China

[3] Microsoft Res, Redmond, WA USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

INFORMATION; CLUSTERINGS;

D O I：

10.1109/ICCV48922.2021.01011

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised approaches rely on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audiovisual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.

引用

页码：10254 / 10264

页数：11

共 24 条

[1] Audio-visual large-scale video copy detection
Liu, Yang
Xu, Changsheng
Lu, Hanqing
[J]. INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
[2] VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET
Chen, Honglie
Xie, Weidi
Vedaldi, Andrea
Zisserman, Andrew
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 721 - 725
[3] On-the-fly learning for visual search of large-scale image and video datasets
Chatfield, Ken
Arandjelovic, Relja
Parkhi, Omkar
Zisserman, Andrew
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2015, 4 (02) : 75 - 93
[4] ON ADVERSARIAL ROBUSTNESS OF LARGE-SCALE AUDIO VISUAL LEARNING
Li, Juncheng B.
Qu, Shuhui
Li, Xinjian
Huang, Po-Yao
Metze, Florian
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 231 - 235
[5] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
Zhu, Dandan
Shao, Xuan
Zhang, Kaiwei
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
[J]. APPLIED INTELLIGENCE, 2023, 53 (19) : 22615 - 22634
[6] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
Dandan Zhu
Xuan Shao
Kaiwei Zhang
Xiongkuo Min
Guangtao Zhai
Xiaokang Yang
[J]. Applied Intelligence, 2023, 53 : 22615 - 22634
[7] A Large-scale Depth-based Multimodal Audio-Visual Corpus in Mandarin
Wang, Jianrong
Wang, Liyuan
Zhang, Ju
Yu, Mei
Yu, Ruiguo
Wei, Jianguo
[J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 881 - 885
[8] TOWARDS A LARGE-SCALE AUDIO-VISUAL CORPUS FOR RESEARCH ON AMYOTROPHIC LATERAL SCLEROSIS
Anvar, Aria
Suendermann-Oeft, David
Pautler, David
Ramanarayanan, Vikram
Kumm, Jochen
Norel, Raquel
Fraenkel, Ernest
Navar, Indu
[J]. NEUROLOGY, 2021, 96 (15)
[9] Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks
Schindler, Alexander
Boyer, Martin
Lindley, Andrew
Schreiber, David
Philipp, Thomas
[J]. MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 106 - 119
[10] Automated curation of large-scale cancer histopathology image datasets using deep learning
Hilgers, Lars
Laleh, Narmin Ghaffari
West, Nicholas P.
Westwood, Alice
Hewitt, Katherine J.
Quirke, Philip
Grabsch, Heike, I
Carrero, Zunamys, I
Matthaei, Emylou
Loeffler, Chiara M. L.
Brinker, Titus J.
Yuan, Tanwei
Brenner, Hermann
Brobeil, Alexander
Hoffmeister, Michael
Kather, Jakob Nikolas
[J]. HISTOPATHOLOGY, 2024, 84 (07) : 1139 - 1153

← 1 2 3 →