Image-driven Audio-visual Universal Source Separation

被引:0
|
作者
Li, Chenxing [1 ]
Bai, Ye [1 ]
Wang, Yang [1 ]
Deng, Feng [1 ]
Zhao, Yuanyuan [2 ]
Zhang, Zhuo [1 ]
Wang, Xiaorui [2 ]
机构
[1] Kuaishou Technol Co, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
关键词
audio-visual source separation; universal source separation; image-driven target source separation;
D O I
10.21437/Interspeech.2023-1309
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces an image-driven audio-visual universal source separation (ID-USS) and proposes ID-USS-Conformer. ID-USS aims to separate a target source from the mixture based on the input image that is consistent with the target. Importantly, ID-USS only focuses on the sound made by the target in this image, not on the description of the target or the semantic information of the picture. In detail, ID-USS-Conformer mainly consists of an Efficient-b3-based visual branch and a Conformer-based audio branch. The visual branch extracts the visual clue of the target from the input image. After the audio branch fuses the visual features, ID-USS-Conformer separates the target source from the mixture. We launch an ID-USS dataset and verify the effectiveness of ID-USS-Conformer on it. The ID-USS-Conformer has achieved a 10.139 dB signal-to-distortion ratio improvement in the test set and outperformed the compared methods.
引用
收藏
页码:3729 / 3733
页数:5
相关论文
共 50 条
  • [21] Audio-visual infography: from image to space
    Rafols Cabrisses, Rafael
    ESTUDIOS SOBRE EL MENSAJE PERIODISTICO, 2011, 17 (02): : 569 - 579
  • [22] Audio-Visual Underdetermined Blind Source Separation Algorithm Based on Gaussian Potential Function
    Zhang Ye
    Cao Kang
    Wu Kangrui
    Yu Tenglong
    Zhou Nanrun
    CHINA COMMUNICATIONS, 2014, 11 (06) : 71 - 80
  • [23] Visual Time Series Forecasting: An Image-driven Approach
    Sood, Srijan
    Zeng, Zhen
    Cohen, Naftali
    Balch, Tucker
    Veloso, Manuela
    ICAIF 2021: THE SECOND ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, 2021,
  • [24] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [25] Active Audio-Visual Separation of Dynamic Sound Sources
    Majumder, Sagnik
    Grauman, Kristen
    COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 551 - 569
  • [26] iQuery: Instruments as Queries for Audio-Visual Sound Separation
    Chen, Jiaben
    Zhang, Renrui
    Lian, Dongze
    Yang, Jiaqi
    Zeng, Ziyao
    Shi, Jianbo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14675 - 14686
  • [27] Audio-visual temporal recalibration is driven by decisional processes
    Arnold, D. H.
    Keane, B.
    Yarrow, K.
    PERCEPTION, 2014, 43 (01) : 118 - 118
  • [28] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [29] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [30] Tracking atoms with particles for audio-visual source localization
    Monaci, Gianluca
    Vandergheynst, Pierre
    Maggio, Emilio
    Cavallaro, Andrea
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +