Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis

被引:0
|
作者
Suzuki, Kei [1 ]
Itoyama, Katsutoshi [1 ,2 ]
Nishida, Kenji [1 ]
Nakadai, Kazuhiro [1 ]
机构
[1] Tokyo Inst Technol, Tokyo 1528552, Japan
[2] Honda Res Inst Japan Co Ltd, Saitama 3510188, Japan
来源
2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII | 2023年
关键词
D O I
10.1109/SII55687.2023.10039379
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper proposes a novel audio and visual class association method based on contrastive learning that can obtain not only one-to-one but also one-to-many, many-to-many, and even no correspondence between audio and visual classes. The proposed method consists of two training stages. In the first stage, for "correspondence" training, using self-supervised contrastive learning, one-to-one, one-to-many, and many-to-many correspondences are trained under a criterion that corresponding AV pairs become close and non-corresponding pairs are far. In the second stage, for "non-correspondence" training, those relationships are acquired through contrastive learning using a dataset consisting of pairs of visual and audio classes that have no correspondence. To build such a dataset, we utilize the trends of change in the class embeddings and split a set of all the classes into two subsets, with AV correspondence and without AV correspondence. The trained model with the proposed method was evaluated by the F1-score for the class embedding, and indoor experiment on mapping of two sound sources. As a result, the F1-score was 74.7% after the first stage, and it improved by 1.86 points to 76.6% after the second stage, confirming, the proposed method is effective for mapping between general audio and visual classes, including one-to-many, many-to-many, and non-corresponding classes. Indoor experiment revealed that our model could predict correct correspondence even in real environment.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] Two-stage framework with improved U-Net based on self-supervised contrastive learning for pavement crack segmentation
    Song, Qingsong
    Yao, Wei
    Tian, Haojiang
    Guo, Yidan
    Muniyandi, Ravie Chandren
    An, Yisheng
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [22] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [23] Self-supervised Visual Feature Learning and Classification Framework: Based on Contrastive Learning
    Wang, Zhibo
    Yan, Shen
    Zhang, Xiaoyu
    Lobo, Niels Da Vitoria
    16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 719 - 725
  • [24] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
    Fujita, Yoto
    Bando, Yoshiaki
    Imoto, Keisuke
    Onishi, Masaki
    Yoshii, Kazuyoshi
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
  • [25] Substation Abnormal Scene Recognition Based on Two-Stage Contrastive Learning
    Liu, Shanfeng
    Su, Haitao
    Mao, Wandeng
    Li, Miaomiao
    Zhang, Jun
    Bao, Hua
    ENERGIES, 2024, 17 (24)
  • [26] Contrastive learning based self-supervised time-series analysis
    Poppelbaum, Johannes
    Chadha, Gavneet Singh
    Schwung, Andreas
    APPLIED SOFT COMPUTING, 2022, 117
  • [27] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [28] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [29] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [30] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357