VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

被引：120

作者：

Wang, Quan ^{[1
]}

Muckenhirn, Hannah ^{[2
,3
,4
]}

Wilson, Kevin ^{[1
]}

Sridhar, Prashant ^{[1
]}

Wu, Zelin ^{[1
]}

Hershey, John R. ^{[1
]}

Saurous, Rif A. ^{[1
]}

Weiss, Ron J. ^{[1
]}

Jia, Ye ^{[1
]}

Moreno, Ignacio Lopez ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

[2] Idiap Res Inst, Martigny, Switzerland

[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland

[4] Google, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

Source separation; speaker recognition; spectrogram masking; speech recognition;

D O I：

10.21437/Interspeech.2019-1101

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

引用

页码：2728 / 2732

页数：5

共 25 条

[1] Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring
Singla, Yaman Kumar
Gupta, Avyakt
Bagga, Shaurya
Chen, Changyou
Krishnamurthy, Balaji
Shah, Rajiv Ratn
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 1681 - 1691
[2] VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Wang, Quan
Moreno, Ignacio Lopez
Saglam, Mert
Wilson, Kevin
Chiao, Alan
Liu, Renjie
He, Yanzhang
Li, Wei
Pelecanos, Jason
Nika, Marily
Gruenstein, Alexander
[J]. INTERSPEECH 2020, 2020, : 2677 - 2681
[3] Singing voice separation with pre-learned dictionary and reconstructed voice spectrogram
Yang, Chenghong
Zhang, Hongjuan
[J]. NEURAL COMPUTING & APPLICATIONS, 2020, 32 (08): : 3311 - 3322
[4] Singing voice separation with pre-learned dictionary and reconstructed voice spectrogram
Chenghong Yang
Hongjuan Zhang
[J]. Neural Computing and Applications, 2020, 32 : 3311 - 3322
[5] COMPLEX RATIO MASKING FOR SINGING VOICE SEPARATION
Zhang, Yixuan
Liu, Yuzhou
Wang, DeLiang
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 41 - 45
[6] THE SOUND OF MY VOICE: SPEAKER REPRESENTATION LOSS FOR TARGET VOICE SEPARATION
Mun, Seongkyu
Choe, Soyeon
Huh, Jaesung
Chung, Joon Son
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7289 - 7293
[7] Using Visual Speech Information in Masking Methods for Audio Speaker Separation
Khan, Faheem Ullah
Milner, Ben P.
Le Cornu, Thomas
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (10) : 1742 - 1754
[8] TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Boeddeker, Christoph
Subramanian, Aswin Shanmugam
Wichern, Gordon
Haeb-Umbach, Reinhold
Le Roux, Jonathan
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1185 - 1197
[9] Multi-band Masking for Waveform-based Singing Voice Separation
Papantonakis, Panagiotis
Garoufis, Christos
Maragos, Petros
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 249 - 253
[10] Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Pre-learned Dictionaries
Yu, Shiwei
Zhang, Hongjuan
Duan, Zhiyao
[J]. JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2017, 65 (05): : 377 - 388

← 1 2 3 →