Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis

被引：0

作者：

Suzuki, Kei ^{[1
]}

Itoyama, Katsutoshi ^{[1
,2
]}

Nishida, Kenji ^{[1
]}

Nakadai, Kazuhiro ^{[1
]}

机构：

[1] Tokyo Inst Technol, Tokyo 1528552, Japan

[2] Honda Res Inst Japan Co Ltd, Saitama 3510188, Japan

来源：

2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII | 2023年

关键词：

D O I：

10.1109/SII55687.2023.10039379

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper proposes a novel audio and visual class association method based on contrastive learning that can obtain not only one-to-one but also one-to-many, many-to-many, and even no correspondence between audio and visual classes. The proposed method consists of two training stages. In the first stage, for "correspondence" training, using self-supervised contrastive learning, one-to-one, one-to-many, and many-to-many correspondences are trained under a criterion that corresponding AV pairs become close and non-corresponding pairs are far. In the second stage, for "non-correspondence" training, those relationships are acquired through contrastive learning using a dataset consisting of pairs of visual and audio classes that have no correspondence. To build such a dataset, we utilize the trends of change in the class embeddings and split a set of all the classes into two subsets, with AV correspondence and without AV correspondence. The trained model with the proposed method was evaluated by the F1-score for the class embedding, and indoor experiment on mapping of two sound sources. As a result, the F1-score was 74.7% after the first stage, and it improved by 1.86 points to 76.6% after the second stage, confirming, the proposed method is effective for mapping between general audio and visual classes, including one-to-many, many-to-many, and non-corresponding classes. Indoor experiment revealed that our model could predict correct correspondence even in real environment.

引用

页数：6

共 50 条

[21] Two-stage framework with improved U-Net based on self-supervised contrastive learning for pavement crack segmentation
Song, Qingsong
Yao, Wei
Tian, Haojiang
Guo, Yidan
Muniyandi, Ravie Chandren
An, Yisheng
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[22] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[23] Self-supervised Visual Feature Learning and Classification Framework: Based on Contrastive Learning
Wang, Zhibo
Yan, Shen
Zhang, Xiaoyu
Lobo, Niels Da Vitoria
16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 719 - 725
[24] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Fujita, Yoto
Bando, Yoshiaki
Imoto, Keisuke
Onishi, Masaki
Yoshii, Kazuyoshi
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
[25] Substation Abnormal Scene Recognition Based on Two-Stage Contrastive Learning
Liu, Shanfeng
Su, Haitao
Mao, Wandeng
Li, Miaomiao
Zhang, Jun
Bao, Hua
ENERGIES, 2024, 17 (24)
[26] Contrastive learning based self-supervised time-series analysis
Poppelbaum, Johannes
Chadha, Gavneet Singh
Schwung, Andreas
APPLIED SOFT COMPUTING, 2022, 117
[27] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[28] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[29] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[30] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357

← 1 2 3 4 5 →