Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis

被引:0
|
作者
Suzuki, Kei [1 ]
Itoyama, Katsutoshi [1 ,2 ]
Nishida, Kenji [1 ]
Nakadai, Kazuhiro [1 ]
机构
[1] Tokyo Inst Technol, Tokyo 1528552, Japan
[2] Honda Res Inst Japan Co Ltd, Saitama 3510188, Japan
关键词
D O I
10.1109/SII55687.2023.10039379
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper proposes a novel audio and visual class association method based on contrastive learning that can obtain not only one-to-one but also one-to-many, many-to-many, and even no correspondence between audio and visual classes. The proposed method consists of two training stages. In the first stage, for "correspondence" training, using self-supervised contrastive learning, one-to-one, one-to-many, and many-to-many correspondences are trained under a criterion that corresponding AV pairs become close and non-corresponding pairs are far. In the second stage, for "non-correspondence" training, those relationships are acquired through contrastive learning using a dataset consisting of pairs of visual and audio classes that have no correspondence. To build such a dataset, we utilize the trends of change in the class embeddings and split a set of all the classes into two subsets, with AV correspondence and without AV correspondence. The trained model with the proposed method was evaluated by the F1-score for the class embedding, and indoor experiment on mapping of two sound sources. As a result, the F1-score was 74.7% after the first stage, and it improved by 1.86 points to 76.6% after the second stage, confirming, the proposed method is effective for mapping between general audio and visual classes, including one-to-many, many-to-many, and non-corresponding classes. Indoor experiment revealed that our model could predict correct correspondence even in real environment.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [2] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
    Owens, Andrew
    Efros, Alexei A.
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 639 - 658
  • [3] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [4] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
    Zhang, Jingran
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Lu, Xin
    Shen, Heng Tao
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
  • [5] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [6] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [7] Audio-visual self-supervised representation learning: A survey
    Alsuwat, Manal
    Al-Shareef, Sarah
    Alghamdi, Manal
    NEUROCOMPUTING, 2025, 634
  • [8] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
  • [9] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
    Krishnamurthy, Sudha
    ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
  • [10] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638