Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

被引：2

作者：

Sato, Tomoya ^{[1
]}

Sugano, Yusuke ^{[1
]}

Sato, Yoichi ^{[1
]}

机构：

[1] Univ Tokyo, Inst Ind Sci, Meguro 1538505, Japan

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Computer vision; feature extraction; machine learning; self-supervised learning; audio-visual learning; cross-modal retrieval;

D O I：

10.1109/ACCESS.2022.3204305

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered. However, sound locations are critical for understanding the sound environment. To this end, it is necessary to acquire cross-modal features that reflect the semantic and spatial relationship between videos and sounds. A video with stereo sound, which has become commonly used, provides the direction of arrival of each sound source in addition to the category information. This indicates its potential to acquire a desired cross-modal feature space. In this paper, we propose a novel self-supervised approach to learn a cross-modal feature representation that captures both the category and location of each sound source using stereo sound as input. For a set of unlabeled videos, the proposed method generates three kinds of audio-visual pairs: 1) perfectly matched pairs from the same video, 2) pairs from the same video but with the flipped stereo sound, and 3) pairs from a different video. The cross-modal feature encoder of the proposed method is trained on triplet loss to reflect the relationship between these three pairs (1 > 2 > 3). We apply this method to cross-modal image/audio retrieval. Compared with previous audio-visual pretext tasks, the proposed method shows significant improvement in both real and synthetic datasets.

引用

页码：94273 / 94284

页数：12

共 50 条

[21] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[22] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
[J]. IEEE ACCESS, 2021, 9 : 64346 - 64357
[23] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[24] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[25] Self-Supervised Learning for Videos: A Survey
Schiappa, Madeline C.
Rawat, Yogesh S.
Shah, Mubarak
[J]. ACM COMPUTING SURVEYS, 2023, 55 (13S)
[26] Audio self-supervised learning: A survey
Liu, Shuo
Mallol-Ragolta, Adria
Parada-Cabaleiro, Emilia
Qian, Kun
Jing, Xin
Kathan, Alexander
Hu, Bin
Schuller, Bjorn W.
[J]. PATTERNS, 2022, 3 (12):
[27] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
[J]. Information Fusion, 2024, 108
[28] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
Yang, Yizhuo
Yuan, Shenghai
Cao, Muqing
Yang, Jianfei
Xie, Lihua
[J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
[29] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
[J]. INFORMATION FUSION, 2024, 108
[30] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
Ran, Yue
Tang, Hongying
Li, Baoqing
Wang, Guohui
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (24):

← 1 2 3 4 5 →