Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引：1

作者：

Krishnamurthy, Sudha ^{[1
]}

机构：

[1] Sony Interact Entertainment, San Mateo, CA 94404 USA

来源：

ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II | 2021年 / 13018卷

关键词：

Self-supervision; Representation learning; Cross-modal correlation;

D O I：

10.1007/978-3-030-90436-4_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.

引用

页码：124 / 138

页数：15

共 50 条

[41] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
Ran, Yue
Tang, Hongying
Li, Baoqing
Wang, Guohui
APPLIED SCIENCES-BASEL, 2022, 12 (24):
[42] Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation
Buisson, Morgan
McFee, Brian
Essid, Slim
Crayencour, Helene C.
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2141 - 2152
[43] The Efficacy of Self-Supervised Speech Models as Audio Representations
Wu, Tung-Yu
Hsu, Tsu-Yuan
Li, Chen-An
Lin, Tzu-Han
Lee, Hung-yi
HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110
[44] Self-supervised learning with ensemble representations
Han, Kyoungmin
Lee, Minsik
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 143
[45] Visual Reinforcement Learning With Self-Supervised 3D Representations
Ze, Yanjie
Hansen, Nicklas
Chen, Yinbo
Jain, Mohit
Wang, Xiaolong
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (05) : 2890 - 2897
[46] Repeat and learn: Self-supervised visual representations learning by Scene Localization
Altabrawee, Hussein
Noor, Mohd Halim Mohd
PATTERN RECOGNITION, 2024, 156
[47] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
Chern, I-Chun
Hung, Kuo-Hsuan
Chen, Yi-Ting
Hussain, Tassadaq
Gogate, Mandar
Hussain, Amir
Tsao, Yu
Hou, Jen-Cheng
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[48] Learning Representations from Audio-Visual Spatial Alignment
Morgado, Pedro
Li, Yi
Vasconcelos, Nuno
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[49] Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis
Suzuki, Kei
Itoyama, Katsutoshi
Nishida, Kenji
Nakadai, Kazuhiro
2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII, 2023,
[50] Self-Supervised Learning of Smart Contract Representations
Yang, Shouliang
Gu, Xiaodong
Shen, Beijun
30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 82 - 93

← 1 2 3 4 5 →