Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis

被引：0

作者：

Suzuki, Kei ^{[1
]}

Itoyama, Katsutoshi ^{[1
,2
]}

Nishida, Kenji ^{[1
]}

Nakadai, Kazuhiro ^{[1
]}

机构：

[1] Tokyo Inst Technol, Tokyo 1528552, Japan

[2] Honda Res Inst Japan Co Ltd, Saitama 3510188, Japan

来源：

2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII | 2023年

关键词：

D O I：

10.1109/SII55687.2023.10039379

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper proposes a novel audio and visual class association method based on contrastive learning that can obtain not only one-to-one but also one-to-many, many-to-many, and even no correspondence between audio and visual classes. The proposed method consists of two training stages. In the first stage, for "correspondence" training, using self-supervised contrastive learning, one-to-one, one-to-many, and many-to-many correspondences are trained under a criterion that corresponding AV pairs become close and non-corresponding pairs are far. In the second stage, for "non-correspondence" training, those relationships are acquired through contrastive learning using a dataset consisting of pairs of visual and audio classes that have no correspondence. To build such a dataset, we utilize the trends of change in the class embeddings and split a set of all the classes into two subsets, with AV correspondence and without AV correspondence. The trained model with the proposed method was evaluated by the F1-score for the class embedding, and indoor experiment on mapping of two sound sources. As a result, the F1-score was 74.7% after the first stage, and it improved by 1.86 points to 76.6% after the second stage, confirming, the proposed method is effective for mapping between general audio and visual classes, including one-to-many, many-to-many, and non-corresponding classes. Indoor experiment revealed that our model could predict correct correspondence even in real environment.

引用

页数：6

共 50 条

[1] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
Liu, Yang
Tan, Ying
Lan, Haoyuan
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
[2] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Owens, Andrew
Efros, Alexei A.
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 639 - 658
[3] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[4] Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning
Zhang, Jingran
Xu, Xing
Shen, Fumin
Lu, Huimin
Lu, Xin
Shen, Heng Tao
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3351 - 3359
[5] Robust Self-Supervised Audio-Visual Speech Recognition
Shi, Bowen
Hsu, Wei-Ning
Mohamed, Abdelrahman
INTERSPEECH 2022, 2022, : 2118 - 2122
[6] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[7] Audio-visual self-supervised representation learning: A survey
Alsuwat, Manal
Al-Shareef, Sarah
Alghamdi, Manal
NEUROCOMPUTING, 2025, 634
[8] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
Tellamekala, Mani Kumar
Valstar, Michel
Pound, Michael
Giesbrecht, Timo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919
[9] Learning Self-supervised Audio-Visual Representations for Sound Recommendations
Krishnamurthy, Sudha
ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II, 2021, 13018 : 124 - 138
[10] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
Terbouche, Hacene
Schoneveld, Liam
Benson, Oisin
Othmani, Alice
IEEE ACCESS, 2022, 10 : 41622 - 41638

← 1 2 3 4 5 →