VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

被引:120
|
作者
Wang, Quan [1 ]
Muckenhirn, Hannah [2 ,3 ,4 ]
Wilson, Kevin [1 ]
Sridhar, Prashant [1 ]
Wu, Zelin [1 ]
Hershey, John R. [1 ]
Saurous, Rif A. [1 ]
Weiss, Ron J. [1 ]
Jia, Ye [1 ]
Moreno, Ignacio Lopez [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
[2] Idiap Res Inst, Martigny, Switzerland
[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[4] Google, Mountain View, CA 94043 USA
来源
关键词
Source separation; speaker recognition; spectrogram masking; speech recognition;
D O I
10.21437/Interspeech.2019-1101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
引用
收藏
页码:2728 / 2732
页数:5
相关论文
共 25 条
  • [1] Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring
    Singla, Yaman Kumar
    Gupta, Avyakt
    Bagga, Shaurya
    Chen, Changyou
    Krishnamurthy, Balaji
    Shah, Rajiv Ratn
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 1681 - 1691
  • [2] VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
    Wang, Quan
    Moreno, Ignacio Lopez
    Saglam, Mert
    Wilson, Kevin
    Chiao, Alan
    Liu, Renjie
    He, Yanzhang
    Li, Wei
    Pelecanos, Jason
    Nika, Marily
    Gruenstein, Alexander
    [J]. INTERSPEECH 2020, 2020, : 2677 - 2681
  • [3] Singing voice separation with pre-learned dictionary and reconstructed voice spectrogram
    Yang, Chenghong
    Zhang, Hongjuan
    [J]. NEURAL COMPUTING & APPLICATIONS, 2020, 32 (08): : 3311 - 3322
  • [4] Singing voice separation with pre-learned dictionary and reconstructed voice spectrogram
    Chenghong Yang
    Hongjuan Zhang
    [J]. Neural Computing and Applications, 2020, 32 : 3311 - 3322
  • [5] COMPLEX RATIO MASKING FOR SINGING VOICE SEPARATION
    Zhang, Yixuan
    Liu, Yuzhou
    Wang, DeLiang
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 41 - 45
  • [6] THE SOUND OF MY VOICE: SPEAKER REPRESENTATION LOSS FOR TARGET VOICE SEPARATION
    Mun, Seongkyu
    Choe, Soyeon
    Huh, Jaesung
    Chung, Joon Son
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7289 - 7293
  • [7] Using Visual Speech Information in Masking Methods for Audio Speaker Separation
    Khan, Faheem Ullah
    Milner, Ben P.
    Le Cornu, Thomas
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1742 - 1754
  • [8] TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
    Boeddeker, Christoph
    Subramanian, Aswin Shanmugam
    Wichern, Gordon
    Haeb-Umbach, Reinhold
    Le Roux, Jonathan
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1185 - 1197
  • [9] Multi-band Masking for Waveform-based Singing Voice Separation
    Papantonakis, Panagiotis
    Garoufis, Christos
    Maragos, Petros
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 249 - 253
  • [10] Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Pre-learned Dictionaries
    Yu, Shiwei
    Zhang, Hongjuan
    Duan, Zhiyao
    [J]. JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2017, 65 (05): : 377 - 388