Image-driven Audio-visual Universal Source Separation

被引：0

作者：

Li, Chenxing ^{[1
]}

Bai, Ye ^{[1
]}

Wang, Yang ^{[1
]}

Deng, Feng ^{[1
]}

Zhao, Yuanyuan ^{[2
]}

Zhang, Zhuo ^{[1
]}

Wang, Xiaorui ^{[2
]}

机构：

[1] Kuaishou Technol Co, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

audio-visual source separation; universal source separation; image-driven target source separation;

D O I：

10.21437/Interspeech.2023-1309

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper introduces an image-driven audio-visual universal source separation (ID-USS) and proposes ID-USS-Conformer. ID-USS aims to separate a target source from the mixture based on the input image that is consistent with the target. Importantly, ID-USS only focuses on the sound made by the target in this image, not on the description of the target or the semantic information of the picture. In detail, ID-USS-Conformer mainly consists of an Efficient-b3-based visual branch and a Conformer-based audio branch. The visual branch extracts the visual clue of the target from the input image. After the audio branch fuses the visual features, ID-USS-Conformer separates the target source from the mixture. We launch an ID-USS dataset and verify the effectiveness of ID-USS-Conformer on it. The ID-USS-Conformer has achieved a 10.139 dB signal-to-distortion ratio improvement in the test set and outperformed the compared methods.

引用

页码：3729 / 3733

页数：5

共 50 条

[1] Developing an audio-visual speech source separation algorithm
Sodoyer, D
Girin, L
Jutten, C
Schwartz, JL
SPEECH COMMUNICATION, 2004, 44 (1-4) : 113 - 125
[2] Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Chatterjee, Moitreya
Ahuja, Narendra
Cherian, Anoop
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[3] Audio-Visual Based Online Multi-Source Separation
Ong, Jonah
Vo, Ba Tuong
Nordholm, Sven
Vo, Ba-Ngu
Moratuwage, Diluka
Shim, Changbeom
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1219 - 1234
[4] Information-Driven Active Audio-Visual Source Localization
Schult, Niclas
Reineking, Thomas
Kluss, Thorsten
Zetzsche, Christoph
PLOS ONE, 2015, 10 (09):
[5] The "Diagram" as the Audio-Visual Image
Shimoyama, Hiroya
2019 NICOGRAPH INTERNATIONAL (NICOINT), 2019, : 62 - 65
[6] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
Song, Zengjie
Zhang, Zhaoxiang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15528 - 15542
[7] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
Song, Zengjie
Zhang, Zhaoxiang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15528 - 15542
[8] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
Lu, Rui
Duan, Zhiyao
Zhang, Changshui
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
[9] Move2Hear: Active Audio-Visual Source Separation
Majumder, Sagnik
Al-Halah, Ziad
Grauman, Kristen
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 275 - 285
[10] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
Gogate, Mandar
Adeel, Ahsan
Marxer, Ricard
Barker, Jon
Hussain, Amir
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727

← 1 2 3 4 5 →