TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition

被引:7
|
作者
Li, Wenjie [1 ]
Zhang, Pengyuan [1 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Chinese Acad Sci, Xinjiang Lab Minor Speech & Language Informat Pro, Xinjiang Tech Inst Phys & Chem, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
SEPARATION;
D O I
10.1049/el.2019.1228
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
It is challenging to perform automatic speech recognition when multiple people talk simultaneously. To solve this problem, speaker-aware selective methods have been proposed to extract the speech of the target speaker, relying on the auxiliary speaker characteristics provided by an anchor (a clean audio sample of the target speaker). However, the extraction performance depends on the duration and quality of the anchors, which is unstable. To address this limitation, the authors propose a target speaker extraction network (TEnet) which applies the robust speaker embedding to extract the target speech from the speech mixture. To get more stable speaker characteristics during training, the robust speaker embeddings are accumulated over all the speech of each target speaker, rather than utilising the embedding produced by a single anchor. As for testing, very few anchors are enough to get decent extraction performance. Results show the TEnet trained with accumulated embedding achieves better performance and robustness compared with the single-anchored TEnet. Moreover, to exploit the potential of the speaker embedding, the authors propose to feed the extracted target speech as anchor and train a feedback TEnet, whose results are superior to the short-anchored baseline for 22.5% on word error rate and 15.5% on signal-to-distortion rate.
引用
收藏
页码:816 / 818
页数:3
相关论文
共 50 条
  • [41] Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
    Zmolikova, Katerina
    Delcroix, Marc
    Raj, Desh
    Watanabe, Shinji
    Cernocky, Jan Honza
    INTERSPEECH 2021, 2021, : 1464 - 1468
  • [42] A new feature extraction based the reliability of speech in speaker recognition
    Yang, Z
    Li, CW
    2002 6TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I AND II, 2002, : 536 - 539
  • [43] Continuous speech recognition using an on-line speaker adaptation method based on automatic speaker clustering
    Zhang, W
    Nakagawa, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2003, E86D (03) : 464 - 473
  • [44] Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition
    Sivasankaran, Sunit
    Vincent, Emmanuel
    Fohr, Dominique
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 346 - 350
  • [45] Impact of Emotional Speech to Automatic Speaker Recognition - Experiments on GEES Speech Database
    Jokic, Ivan
    Jokic, Stevan
    Delic, Vlado
    Peric, Zoran
    SPEECH AND COMPUTER, 2014, 8773 : 268 - 275
  • [46] AN AUTOMATIC SPEAKER RECOGNITION SYSTEM
    Akrouf, Samir
    Mehamel, Abbas
    Benhamouda, Nacera
    Mostefai, Messaoud
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING (ICACTE 2009), VOLS 1 AND 2, 2009, : 719 - 727
  • [47] Methodologies for the evaluation of Speaker Diarization and Automatic Speech Recognition in the presence of overlapping speech
    Galibert, Olivier
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1130 - 1133
  • [48] Survery on automatic speaker recognition
    Bing, Hu
    Han, Chunguang
    2007 INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE & TECHNOLOGY, PROCEEDINGS, 2007, : 467 - 471
  • [49] An automatic Speaker recognition system
    Chakraborty, P.
    Ahmed, F.
    Kabir, Md. Monirul
    Shahjahan, Md.
    Murase, Kazuyuki
    NEURAL INFORMATION PROCESSING, PART I, 2008, 4984 : 517 - +
  • [50] Speaker Embedding Extraction with Phonetic Information
    Liu, Yi
    He, Liang
    Liu, Jia
    Johnson, Michael T.
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2247 - 2251