VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET

被引:0
|
作者
Chen, Honglie [1 ]
Xie, Weidi [1 ]
Vedaldi, Andrea [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, VGG, Dept Engn Sci, Oxford, England
基金
英国工程与自然科学研究理事会;
关键词
audio recognition; audio-visual correspondence; large-scale; dataset; convolutional neural network; EVENTS;
D O I
10.1109/icassp40776.2020.9053174
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos `in the wild' using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 200k videos for 300 audio classes. Third, we investigate various Convolutional Neural Network (CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions.
引用
下载
收藏
页码:721 / 725
页数:5
相关论文
共 50 条
  • [1] Audio-visual large-scale video copy detection
    Liu, Yang
    Xu, Changsheng
    Lu, Hanqing
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 2011, 88 (18) : 3803 - 3816
  • [2] AVCAffe: A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work
    Sarkar, Pritam
    Posen, Aaron
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 76 - 85
  • [3] A Large-scale Depth-based Multimodal Audio-Visual Corpus in Mandarin
    Wang, Jianrong
    Wang, Liyuan
    Zhang, Ju
    Yu, Mei
    Yu, Ruiguo
    Wei, Jianguo
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 881 - 885
  • [4] TOWARDS A LARGE-SCALE AUDIO-VISUAL CORPUS FOR RESEARCH ON AMYOTROPHIC LATERAL SCLEROSIS
    Anvar, Aria
    Suendermann-Oeft, David
    Pautler, David
    Ramanarayanan, Vikram
    Kumm, Jochen
    Norel, Raquel
    Fraenkel, Ernest
    Navar, Indu
    NEUROLOGY, 2021, 96 (15)
  • [5] AUDIO-VISUAL COUNSELING SCALE
    GRIFFIN, GG
    PERSONNEL AND GUIDANCE JOURNAL, 1968, 46 (07): : 690 - 693
  • [6] Multilingual Audio-Visual Smartphone Dataset and Evaluation
    Mandalapu, Hareesh
    Reddy, P. N. Aravinda
    Ramachandra, Raghavendra
    Rao, Krothapalli Sreenivasa
    Mitra, Pabitra
    Prasanna, S. R. Mahadeva
    Busch, Christoph
    IEEE ACCESS, 2021, 9 : 153240 - 153257
  • [7] Solos: A Dataset for Audio-Visual Music Analysis
    Montesinos, Juan F.
    Slizovskaia, Olga
    Haro, Gloria
    2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,
  • [8] The Audio-Visual Arabic Dataset for Natural Emotions
    Abu Shaqra, Ftoon
    Duwairi, Rehab
    Al-Ayyoub, Mahmoud
    2019 7TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD 2019), 2019, : 324 - 329
  • [9] Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
    Geng, Tiantian
    Wang, Teng
    Duan, Jinming
    Cong, Runmin
    Zheng, Feng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22942 - 22951
  • [10] Large-Scale Processing, Indexing and Search System for Czech Audio-Visual Cultural Heritage Archives
    Nouza, Jan
    Blavka, Karel
    Zdansky, Jindrich
    Cerva, Petr
    Silovsky, Jan
    Bohac, Marek
    Chaloupka, Josef
    Kucharova, Michaela
    Seps, Ladislav
    2012 IEEE 14TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2012, : 337 - 342