Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引:1
|
作者
Krishnamurthy, Sudha [1 ]
机构
[1] Sony Interact Entertainment, San Mateo, CA 94404 USA
关键词
Self-supervision; Representation learning; Cross-modal correlation;
D O I
10.1007/978-3-030-90436-4_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
引用
收藏
页码:124 / 138
页数:15
相关论文
共 50 条
  • [31] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [32] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357
  • [33] Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking
    Carr, Andrew N.
    Berthet, Quentin
    Blondel, Mathieu
    Teboul, Olivier
    Zeghidour, Neil
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 708 - 712
  • [34] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [35] Self-Supervised Visual Representations Learning by Contrastive Mask Prediction
    Zhao, Yucheng
    Wang, Guangting
    Luo, Chong
    Zeng, Wenjun
    Zha, Zheng-Jun
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10140 - 10149
  • [36] Towards Efficient and Effective Self-supervised Learning of Visual Representations
    Addepalli, Sravanti
    Bhogale, Kaushal
    Dey, Priyam
    Babu, R. Venkatesh
    COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 523 - 538
  • [37] Audio self-supervised learning: A survey
    Liu, Shuo
    Mallol-Ragolta, Adria
    Parada-Cabaleiro, Emilia
    Qian, Kun
    Jing, Xin
    Kathan, Alexander
    Hu, Bin
    Schuller, Bjorn W.
    PATTERNS, 2022, 3 (12):
  • [38] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    Information Fusion, 2024, 108
  • [39] AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
    Yang, Yizhuo
    Yuan, Shenghai
    Cao, Muqing
    Yang, Jianfei
    Xie, Lihua
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1871 - 1877
  • [40] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
    Sun, Licai
    Lian, Zheng
    Liu, Bin
    Tao, Jianhua
    INFORMATION FUSION, 2024, 108