Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

被引:2
|
作者
Sato, Tomoya [1 ]
Sugano, Yusuke [1 ]
Sato, Yoichi [1 ]
机构
[1] Univ Tokyo, Inst Ind Sci, Meguro 1538505, Japan
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Computer vision; feature extraction; machine learning; self-supervised learning; audio-visual learning; cross-modal retrieval;
D O I
10.1109/ACCESS.2022.3204305
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered. However, sound locations are critical for understanding the sound environment. To this end, it is necessary to acquire cross-modal features that reflect the semantic and spatial relationship between videos and sounds. A video with stereo sound, which has become commonly used, provides the direction of arrival of each sound source in addition to the category information. This indicates its potential to acquire a desired cross-modal feature space. In this paper, we propose a novel self-supervised approach to learn a cross-modal feature representation that captures both the category and location of each sound source using stereo sound as input. For a set of unlabeled videos, the proposed method generates three kinds of audio-visual pairs: 1) perfectly matched pairs from the same video, 2) pairs from the same video but with the flipped stereo sound, and 3) pairs from a different video. The cross-modal feature encoder of the proposed method is trained on triplet loss to reflect the relationship between these three pairs (1 > 2 > 3). We apply this method to cross-modal image/audio retrieval. Compared with previous audio-visual pretext tasks, the proposed method shows significant improvement in both real and synthetic datasets.
引用
收藏
页码:94273 / 94284
页数:12
相关论文
共 50 条
  • [1] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [2] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [3] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
  • [4] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
    Zhang, Jingran
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Lu, Xin
    Shen, Heng Tao
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
  • [5] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [6] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    [J]. ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [7] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    [J]. IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [8] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [9] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [10] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    [J]. INTERSPEECH 2022, 2022, : 2118 - 2122