Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment

被引:1
|
作者
Wang, Shanshan [1 ]
Politis, Archontis [1 ]
Mesaros, Annamaria [1 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ, Fac Informat Technol & Commun Sci, Tampere 33100, Finland
基金
芬兰科学院;
关键词
Audio classification; audio-visual correspondence; audio-visual data; audio-visual spatial alignment; feature learning; self-supervised learning; LOCALIZATION;
D O I
10.1109/JSTSP.2022.3180592
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Learning fromaudio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360. video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10% improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstreamtask. Anumber of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.
引用
收藏
页码:1467 / 1479
页数:13
相关论文
共 50 条
  • [1] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    [J]. ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [2] Learning Representations from Audio-Visual Spatial Alignment
    Morgado, Pedro
    Li, Yi
    Vasconcelos, Nuno
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [3] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [4] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [5] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
    Zhang, Jingran
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Lu, Xin
    Shen, Heng Tao
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
  • [6] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
  • [7] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    [J]. IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [8] Self-supervised object detection from audio-visual correspondence
    Afouras, Triantafyllos
    Asano, Yuki M.
    Fagan, Francois
    Vedaldi, Andrea
    Metze, Florian
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10565 - 10576
  • [9] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [10] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    [J]. IEEE ACCESS, 2022, 10 : 94273 - 94284