Robust Speaker Extraction Network based on Iterative Refined Adaptation

被引：5

作者：

Deng, Chengyun ^{[1
]}

Ma, Shiqian ^{[1
]}

Sha, Yongtao ^{[1
]}

Zhang, Yi ^{[1
]}

Zhang, Hui ^{[2
]}

Song, Hui ^{[1
]}

Wang, Fei ^{[1
]}

机构：

[1] Didi Chuxing, Beijing, Peoples R China

[2] Baidu Inc, Beijing, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

speaker extraction; iterative refined adaptation; speaker embedding; robustness;

D O I：

10.21437/Interspeech.2021-2250

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Speaker extraction aims to extract target speech signal from a multi-talker environment with interference speakers and surrounding noise, given a reference speech from target speaker. Most speaker extraction systems achieve satisfactory performance in the closed condition. Such systems suffer from performance degradation given unseen target speakers and/or mismatched reference speech. In this paper we propose a novel strategy named Iterative Refined Adaptation (IRA) to improve the robustness and generalization capability of speaker extraction systems in the aforementioned scenarios. Given an initial speaker embedding encoded by an auxiliary network, the extraction network can obtain a latent representation of the target speaker as the feedback of the auxiliary network to refine the speaker embedding, which provides more accurate guidance for the extraction network. Experiments show that the network with IRA confirm the superior performance over comparison approaches in terms of SI-SDRi and PESQ on WSJ0-2mix-extr and WHAM! dataset.

引用

页码：3530 / 3534

页数：5

共 50 条

[31] ATTENTION-BASED NEURAL NETWORK FOR JOINT DIARIZATION AND SPEAKER EXTRACTION
Chazan, Shlomo E.
Gannot, Sharon
Goldberger, Jacob
[J]. 2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 301 - 305
[32] Robust speaker adaptation by weighted model averaging based on the minimum description length criterion
Cui, Xiaodong
Alwan, Abeer
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (02): : 652 - 660
[33] SEQUENCE SUMMARIZING NEURAL NETWORK FOR SPEAKER ADAPTATION
Vesely, Karel
Watanabe, Shinji
Zmolikova, Katerina
Karafiat, Martin
Burget, Lukas
Cernocky, Jan Honza
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5315 - 5319
[34] Robust i-vector extraction for neural network adaptation in noisy environment
Yu, Chengzhu
Ogawa, Atsunori
Delcroix, Marc
Yoshioka, Takuya
Nakatani, Tomohiro
Hansen, John H. L.
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2854 - 2857
[35] Acoustic feature extraction method for robust speaker identification
Zuoqiang Li
Yong Gao
[J]. Multimedia Tools and Applications, 2016, 75 : 7391 - 7406
[36] Ensemble Speaker Modeling using Speaker Adaptive Training Deep Neural Network for Speaker Adaptation
Li, Sheng
Lu, Xugang
Akita, Yuya
Kawahara, Tatsuya
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2892 - 2896
[37] Acoustic feature extraction method for robust speaker identification
Li, Zuoqiang
Gao, Yong
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (12) : 7391 - 7406
[38] An Auditory Feature Extraction Method for Robust Speaker Recognition
Hu, Fengsong
Cao, Xiaoyu
[J]. PROCEEDINGS OF 2012 IEEE 14TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY, 2012, : 1067 - 1071
[39] A robust extraction algorithm based on ICA neural network
Ye, Yalan
Zhang, Zhi-Lin
Mo, Quanyi
Zeng, Jiazhi
[J]. 2007 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS PROCEEDINGS, VOLS 1 AND 2: VOL 1: COMMUNICATION THEORY AND SYSTEMS; VOL 2: SIGNAL PROCESSING, COMPUTATIONAL INTELLIGENCE, CIRCUITS AND SYSTEMS, 2007, : 872 - +
[40] Kernel-based speaker clustering for rapid speaker adaptation
Hazrati, Dooz
Ahadi, S. M.
Sadjadi, Omid
[J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, 2008, : 1287 - 1289

← 1 2 3 4 5 →