Image-driven Audio-visual Universal Source Separation

被引:0
|
作者
Li, Chenxing [1 ]
Bai, Ye [1 ]
Wang, Yang [1 ]
Deng, Feng [1 ]
Zhao, Yuanyuan [2 ]
Zhang, Zhuo [1 ]
Wang, Xiaorui [2 ]
机构
[1] Kuaishou Technol Co, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
关键词
audio-visual source separation; universal source separation; image-driven target source separation;
D O I
10.21437/Interspeech.2023-1309
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces an image-driven audio-visual universal source separation (ID-USS) and proposes ID-USS-Conformer. ID-USS aims to separate a target source from the mixture based on the input image that is consistent with the target. Importantly, ID-USS only focuses on the sound made by the target in this image, not on the description of the target or the semantic information of the picture. In detail, ID-USS-Conformer mainly consists of an Efficient-b3-based visual branch and a Conformer-based audio branch. The visual branch extracts the visual clue of the target from the input image. After the audio branch fuses the visual features, ID-USS-Conformer separates the target source from the mixture. We launch an ID-USS dataset and verify the effectiveness of ID-USS-Conformer on it. The ID-USS-Conformer has achieved a 10.139 dB signal-to-distortion ratio improvement in the test set and outperformed the compared methods.
引用
收藏
页码:3729 / 3733
页数:5
相关论文
共 50 条
  • [41] Audio-visual sound separation via hidden Markov models
    Hershey, J
    Casey, M
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 14, VOLS 1 AND 2, 2002, 14 : 1173 - 1180
  • [42] DEEP VARIATIONAL GENERATIVE MODELS FOR AUDIO-VISUAL SPEECH SEPARATION
    Viet-Nhat Nguyen
    Sadeghi, Mostafa
    Ricci, Elisa
    Alameda-Pineda, Xavier
    2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
  • [43] Audio-Visual Speech Separation Using I-Vectors
    Luo, Yiyu
    Wang, Jing
    Wang, Xinyao
    Wen, Liang
    Wang, Lizhong
    2019 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP), 2019, : 276 - 280
  • [44] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
  • [45] Audio-Visual Segmentation
    Zhou, Jinxing
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
  • [46] AUDIO-VISUAL TECHNOLOGIES
    TAKESHITA, M
    FURUKAWA, M
    HAYATSU, R
    MURAKAMI, R
    SUZUKI, K
    HASHIZUME, K
    NEC RESEARCH & DEVELOPMENT, 1990, (96): : 265 - 277
  • [47] AUDIO-VISUAL POTPOURRI
    不详
    INDUSTRIAL PHOTOGRAPHY, 1968, 17 (07): : 30 - &
  • [48] Audio-Visual Techniques
    Sears, William P., Jr.
    EDUCATION, 1948, 69 (02): : 132 - 132
  • [49] AUDIO-VISUAL UNIT
    WHARTON, BA
    PEDIATRICS, 1971, 47 (05) : 957 - &
  • [50] Audio-visual imposture
    Karam, Walid
    Mokbel, Chafic
    Greige, Hanna
    Chollet, Gerard
    MOBILE MULTIMEDIA/IMAGE PROCESSING FOR MILITARY AND SECURITY APPLICATIONS, 2006, 6250