Image-driven Audio-visual Universal Source Separation

被引：0

作者：

Li, Chenxing ^{[1
]}

Bai, Ye ^{[1
]}

Wang, Yang ^{[1
]}

Deng, Feng ^{[1
]}

Zhao, Yuanyuan ^{[2
]}

Zhang, Zhuo ^{[1
]}

Wang, Xiaorui ^{[2
]}

机构：

[1] Kuaishou Technol Co, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

audio-visual source separation; universal source separation; image-driven target source separation;

D O I：

10.21437/Interspeech.2023-1309

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper introduces an image-driven audio-visual universal source separation (ID-USS) and proposes ID-USS-Conformer. ID-USS aims to separate a target source from the mixture based on the input image that is consistent with the target. Importantly, ID-USS only focuses on the sound made by the target in this image, not on the description of the target or the semantic information of the picture. In detail, ID-USS-Conformer mainly consists of an Efficient-b3-based visual branch and a Conformer-based audio branch. The visual branch extracts the visual clue of the target from the input image. After the audio branch fuses the visual features, ID-USS-Conformer separates the target source from the mixture. We launch an ID-USS dataset and verify the effectiveness of ID-USS-Conformer on it. The ID-USS-Conformer has achieved a 10.139 dB signal-to-distortion ratio improvement in the test set and outperformed the compared methods.

引用

页码：3729 / 3733

页数：5

共 50 条

[21] Audio-visual infography: from image to space
Rafols Cabrisses, Rafael
ESTUDIOS SOBRE EL MENSAJE PERIODISTICO, 2011, 17 (02): : 569 - 579
[22] Audio-Visual Underdetermined Blind Source Separation Algorithm Based on Gaussian Potential Function
Zhang Ye
Cao Kang
Wu Kangrui
Yu Tenglong
Zhou Nanrun
CHINA COMMUNICATIONS, 2014, 11 (06) : 71 - 80
[23] Visual Time Series Forecasting: An Image-driven Approach
Sood, Srijan
Zeng, Zhen
Cohen, Naftali
Balch, Tucker
Veloso, Manuela
ICAIF 2021: THE SECOND ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, 2021,
[24] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
Li, Chenda
Qian, Yanmin
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
[25] Active Audio-Visual Separation of Dynamic Sound Sources
Majumder, Sagnik
Grauman, Kristen
COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 551 - 569
[26] iQuery: Instruments as Queries for Audio-Visual Sound Separation
Chen, Jiaben
Zhang, Renrui
Lian, Dongze
Yang, Jiaqi
Zeng, Ziyao
Shi, Jianbo
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14675 - 14686
[27] Audio-visual temporal recalibration is driven by decisional processes
Arnold, D. H.
Keane, B.
Yarrow, K.
PERCEPTION, 2014, 43 (01) : 118 - 118
[28] An audio-visual distance for audio-visual speech vector quantization
Girin, L
Foucher, E
Feng, G
1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
[29] Catching audio-visual mice:: The extrapolation of audio-visual speed
Hofbauer, MM
Wuerger, SM
Meyer, GF
Röhrbein, F
Schill, K
Zetzsche, C
PERCEPTION, 2003, 32 : 96 - 96
[30] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +

← 1 2 3 4 5 →