Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引：1

作者：

Krishnamurthy, Sudha ^{[1
]}

机构：

[1] Sony Interact Entertainment, San Mateo, CA 94404 USA

来源：

ADVANCES IN VISUAL COMPUTING (ISVC 2021), PT II | 2021年 / 13018卷

关键词：

Self-supervision; Representation learning; Cross-modal correlation;

D O I：

10.1007/978-3-030-90436-4_10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.

引用

页码：124 / 138

页数：15

共 50 条

[31] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[32] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357
[33] Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking
Carr, Andrew N.
Berthet, Quentin
Blondel, Mathieu
Teboul, Olivier
Zeghidour, Neil
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 708 - 712
[34] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[35] Self-Supervised Visual Representations Learning by Contrastive Mask Prediction
Zhao, Yucheng
Wang, Guangting
Luo, Chong
Zeng, Wenjun
Zha, Zheng-Jun
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10140 - 10149
[36] Towards Efficient and Effective Self-supervised Learning of Visual Representations
Addepalli, Sravanti
Bhogale, Kaushal
Dey, Priyam
Babu, R. Venkatesh
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 523 - 538
[37] Audio self-supervised learning: A survey
Liu, Shuo
Mallol-Ragolta, Adria
Parada-Cabaleiro, Emilia
Qian, Kun
Jing, Xin
Kathan, Alexander
Hu, Bin
Schuller, Bjorn W.
PATTERNS, 2022, 3 (12):
[38] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
Information Fusion, 2024, 108
[39] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
Yang, Yizhuo
Yuan, Shenghai
Cao, Muqing
Yang, Jianfei
Xie, Lihua
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
[40] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
INFORMATION FUSION, 2024, 108

← 1 2 3 4 5 →