VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET

被引：0

作者：

Chen, Honglie ^{[1
]}

Xie, Weidi ^{[1
]}

Vedaldi, Andrea ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, VGG, Dept Engn Sci, Oxford, England

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

英国工程与自然科学研究理事会;

关键词：

audio recognition; audio-visual correspondence; large-scale; dataset; convolutional neural network; EVENTS;

D O I：

10.1109/icassp40776.2020.9053174

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Our goal is to collect a large-scale audio-visual dataset with low label noise from videos `in the wild' using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 200k videos for 300 audio classes. Third, we investigate various Convolutional Neural Network (CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions.

引用

下载

页码：721 / 725

页数：5

共 50 条

[1] Audio-visual large-scale video copy detection
Liu, Yang
Xu, Changsheng
Lu, Hanqing
INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
[2] AVCAffe: A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work
Sarkar, Pritam
Posen, Aaron
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 76 - 85
[3] A Large-scale Depth-based Multimodal Audio-Visual Corpus in Mandarin
Wang, Jianrong
Wang, Liyuan
Zhang, Ju
Yu, Mei
Yu, Ruiguo
Wei, Jianguo
IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 881 - 885
[4] TOWARDS A LARGE-SCALE AUDIO-VISUAL CORPUS FOR RESEARCH ON AMYOTROPHIC LATERAL SCLEROSIS
Anvar, Aria
Suendermann-Oeft, David
Pautler, David
Ramanarayanan, Vikram
Kumm, Jochen
Norel, Raquel
Fraenkel, Ernest
Navar, Indu
NEUROLOGY, 2021, 96 (15)
[5] AUDIO-VISUAL COUNSELING SCALE
GRIFFIN, GG
PERSONNEL AND GUIDANCE JOURNAL, 1968, 46 (07): : 690 - 693
[6] Multilingual Audio-Visual Smartphone Dataset and Evaluation
Mandalapu, Hareesh
Reddy, P. N. Aravinda
Ramachandra, Raghavendra
Rao, Krothapalli Sreenivasa
Mitra, Pabitra
Prasanna, S. R. Mahadeva
Busch, Christoph
IEEE ACCESS, 2021, 9 : 153240 - 153257
[7] Solos: A Dataset for Audio-Visual Music Analysis
Montesinos, Juan F.
Slizovskaia, Olga
Haro, Gloria
2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,
[8] The Audio-Visual Arabic Dataset for Natural Emotions
Abu Shaqra, Ftoon
Duwairi, Rehab
Al-Ayyoub, Mahmoud
2019 7TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD 2019), 2019, : 324 - 329
[9] Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Geng, Tiantian
Wang, Teng
Duan, Jinming
Cong, Runmin
Zheng, Feng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22942 - 22951
[10] Large-Scale Processing, Indexing and Search System for Czech Audio-Visual Cultural Heritage Archives
Nouza, Jan
Blavka, Karel
Zdansky, Jindrich
Cerva, Petr
Silovsky, Jan
Bohac, Marek
Chaloupka, Josef
Kucharova, Michaela
Seps, Ladislav
2012 IEEE 14TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2012, : 337 - 342

← 1 2 3 4 5 →