Learning Self-supervised Audio-Visual Representations for Sound Recommendations

被引:1
|
作者
Krishnamurthy, Sudha [1 ]
机构
[1] Sony Interact Entertainment, San Mateo, CA 94404 USA
关键词
Self-supervision; Representation learning; Cross-modal correlation;
D O I
10.1007/978-3-030-90436-4_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
引用
收藏
页码:124 / 138
页数:15
相关论文
共 50 条
  • [41] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [42] Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation
    Buisson, Morgan
    McFee, Brian
    Essid, Slim
    Crayencour, Helene C.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2141 - 2152
  • [43] The Efficacy of Self-Supervised Speech Models as Audio Representations
    Wu, Tung-Yu
    Hsu, Tsu-Yuan
    Li, Chen-An
    Lin, Tzu-Han
    Lee, Hung-yi
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110
  • [44] Self-supervised learning with ensemble representations
    Han, Kyoungmin
    Lee, Minsik
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 143
  • [45] Visual Reinforcement Learning With Self-Supervised 3D Representations
    Ze, Yanjie
    Hansen, Nicklas
    Chen, Yinbo
    Jain, Mohit
    Wang, Xiaolong
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (05) : 2890 - 2897
  • [46] Repeat and learn: Self-supervised visual representations learning by Scene Localization
    Altabrawee, Hussein
    Noor, Mohd Halim Mohd
    PATTERN RECOGNITION, 2024, 156
  • [47] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [48] Learning Representations from Audio-Visual Spatial Alignment
    Morgado, Pedro
    Li, Yi
    Vasconcelos, Nuno
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [49] Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis
    Suzuki, Kei
    Itoyama, Katsutoshi
    Nishida, Kenji
    Nakadai, Kazuhiro
    2023 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION, SII, 2023,
  • [50] Self-Supervised Learning of Smart Contract Representations
    Yang, Shouliang
    Gu, Xiaodong
    Shen, Beijun
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 82 - 93