Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

被引:14
|
作者
Xu, Chenglin [1 ]
Rao, Wei [2 ]
Wu, Jibin [1 ]
Li, Haizhou [1 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[2] Tencent Ethereal Audio Lab, Shenzhen 518057, Peoples R China
基金
新加坡国家研究基金会;
关键词
Training; Decoding; Convolution; Speech enhancement; Voice activity detection; Time-domain analysis; Task analysis; Target speaker verification; speaker extraction; single- and multi-talker speaker verification; RECOGNITION; DIARIZATION; CHANNEL; SEPARATION;
D O I
10.1109/TASLP.2021.3100682
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.
引用
收藏
页码:2696 / 2709
页数:14
相关论文
共 50 条
  • [1] Multi-Channel Speaker Verification for Single and Multi-talker Speech
    Kataria, Saurabh
    Zhang, Shi-Xiong
    Yu, Dong
    [J]. INTERSPEECH 2021, 2021, : 4608 - 4612
  • [2] Target Speaker Extraction for Multi-Talker Speaker Verification
    Rao, Wei
    Xu, Chenglin
    Chng, Eng Siong
    Li, Haizhou
    [J]. INTERSPEECH 2019, 2019, : 1273 - 1277
  • [3] Selective cortical representation of attended speaker in multi-talker speech perception
    Mesgarani, Nima
    Chang, Edward F.
    [J]. NATURE, 2012, 485 (7397) : 233 - U118
  • [4] Selective cortical representation of attended speaker in multi-talker speech perception
    Nima Mesgarani
    Edward F. Chang
    [J]. Nature, 2012, 485 : 233 - 236
  • [5] Speech prosody supports speaker selection and auditory stream segregation in a multi-talker situation
    Kovacs, Petra
    Toth, Brigitta
    Honbolygo, Ferenc
    Szalardy, Orsolya
    Kohari, Anna
    Mady, Katalin
    Magyari, Lilla
    Winkler, Istvan
    [J]. BRAIN RESEARCH, 2023, 1805
  • [6] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1782 - 1786
  • [7] Auditory masking of speech in reverberant multi-talker environments
    Weller, Tobias
    Buchholz, Joerg M.
    Best, Virginia
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (03): : 1303 - 1313
  • [8] The effects of speech processing units on auditory stream segregation and selective attention in a multi-talker (cocktail party) situation
    Toth, Brigitta
    Honbolygo, Ferenc
    Szalardy, Orsolya
    Orosz, Gabor
    Farkas, David
    Winkler, Istvan
    [J]. CORTEX, 2020, 130 : 387 - 400
  • [9] The effects of selective attention and speech acoustics on neural speech-tracking in a multi-talker scene
    Rimmele, Johanna M.
    Golumbic, Elana Zion
    Schroeger, Erich
    Poeppel, David
    [J]. CORTEX, 2015, 68 : 144 - 154
  • [10] Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks
    Tran, Van-Thuan
    Tsai, Wei-Ho
    [J]. IEEE ACCESS, 2020, 8 : 134868 - 134879