Image-driven Audio-visual Universal Source Separation

被引:0
|
作者
Li, Chenxing [1 ]
Bai, Ye [1 ]
Wang, Yang [1 ]
Deng, Feng [1 ]
Zhao, Yuanyuan [2 ]
Zhang, Zhuo [1 ]
Wang, Xiaorui [2 ]
机构
[1] Kuaishou Technol Co, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
关键词
audio-visual source separation; universal source separation; image-driven target source separation;
D O I
10.21437/Interspeech.2023-1309
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces an image-driven audio-visual universal source separation (ID-USS) and proposes ID-USS-Conformer. ID-USS aims to separate a target source from the mixture based on the input image that is consistent with the target. Importantly, ID-USS only focuses on the sound made by the target in this image, not on the description of the target or the semantic information of the picture. In detail, ID-USS-Conformer mainly consists of an Efficient-b3-based visual branch and a Conformer-based audio branch. The visual branch extracts the visual clue of the target from the input image. After the audio branch fuses the visual features, ID-USS-Conformer separates the target source from the mixture. We launch an ID-USS dataset and verify the effectiveness of ID-USS-Conformer on it. The ID-USS-Conformer has achieved a 10.139 dB signal-to-distortion ratio improvement in the test set and outperformed the compared methods.
引用
收藏
页码:3729 / 3733
页数:5
相关论文
共 50 条
  • [1] Developing an audio-visual speech source separation algorithm
    Sodoyer, D
    Girin, L
    Jutten, C
    Schwartz, JL
    SPEECH COMMUNICATION, 2004, 44 (1-4) : 113 - 125
  • [2] Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
    Chatterjee, Moitreya
    Ahuja, Narendra
    Cherian, Anoop
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Audio-Visual Based Online Multi-Source Separation
    Ong, Jonah
    Vo, Ba Tuong
    Nordholm, Sven
    Vo, Ba-Ngu
    Moratuwage, Diluka
    Shim, Changbeom
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1219 - 1234
  • [4] Information-Driven Active Audio-Visual Source Localization
    Schult, Niclas
    Reineking, Thomas
    Kluss, Thorsten
    Zetzsche, Christoph
    PLOS ONE, 2015, 10 (09):
  • [5] The "Diagram" as the Audio-Visual Image
    Shimoyama, Hiroya
    2019 NICOGRAPH INTERNATIONAL (NICOINT), 2019, : 62 - 65
  • [6] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
    Song, Zengjie
    Zhang, Zhaoxiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15528 - 15542
  • [7] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
    Song, Zengjie
    Zhang, Zhaoxiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15528 - 15542
  • [8] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
  • [9] Move2Hear: Active Audio-Visual Source Separation
    Majumder, Sagnik
    Al-Halah, Ziad
    Grauman, Kristen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 275 - 285
  • [10] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
    Gogate, Mandar
    Adeel, Ahsan
    Marxer, Ricard
    Barker, Jon
    Hussain, Amir
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727