TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition

被引:7
|
作者
Li, Wenjie [1 ]
Zhang, Pengyuan [1 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Chinese Acad Sci, Xinjiang Lab Minor Speech & Language Informat Pro, Xinjiang Tech Inst Phys & Chem, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
SEPARATION;
D O I
10.1049/el.2019.1228
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
It is challenging to perform automatic speech recognition when multiple people talk simultaneously. To solve this problem, speaker-aware selective methods have been proposed to extract the speech of the target speaker, relying on the auxiliary speaker characteristics provided by an anchor (a clean audio sample of the target speaker). However, the extraction performance depends on the duration and quality of the anchors, which is unstable. To address this limitation, the authors propose a target speaker extraction network (TEnet) which applies the robust speaker embedding to extract the target speech from the speech mixture. To get more stable speaker characteristics during training, the robust speaker embeddings are accumulated over all the speech of each target speaker, rather than utilising the embedding produced by a single anchor. As for testing, very few anchors are enough to get decent extraction performance. Results show the TEnet trained with accumulated embedding achieves better performance and robustness compared with the single-anchored TEnet. Moreover, to exploit the potential of the speaker embedding, the authors propose to feed the extracted target speech as anchor and train a feedback TEnet, whose results are superior to the short-anchored baseline for 22.5% on word error rate and 15.5% on signal-to-distortion rate.
引用
收藏
页码:816 / 818
页数:3
相关论文
共 50 条
  • [1] Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition
    Chao, Guan-Lin
    Shen, John Paul
    Lane, Ian
    [J]. NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, 2019, : 39 - 43
  • [2] SPEAKER REINFORCEMENT USING TARGET SOURCE EXTRACTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION
    Zorila, Catalin
    Doddipatla, Rama
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6297 - 6301
  • [3] SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Ochiai, Tsubasa
    Nakatani, Tomohiro
    Burget, Lukas
    Cernocky, Jan
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 800 - 814
  • [4] SPEAKER-AWARE TARGET SPEAKER ENHANCEMENT BY JOINTLY LEARNING WITH SPEAKER EMBEDDING EXTRACTION
    Ji, Xuan
    Yu, Meng
    Zhang, Chunlei
    Su, Dan
    Yu, Tao
    Liu, Xiaoyu
    Yu, Dong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7294 - 7298
  • [5] ADAPTING TO THE SPEAKER IN AUTOMATIC SPEECH RECOGNITION
    TALBOT, M
    [J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1987, 27 (04): : 449 - 457
  • [6] Speech recognition as feature extraction for speaker recognition
    Stolcke, A.
    Shriberg, E.
    Ferrer, L.
    Kajarekar, S.
    Sonmez, K.
    Tur, G.
    [J]. 2007 IEEE WORKSHOP ON SIGNAL PROCESSING APPLICATIONS FOR PUBLIC SECURITY AND FORENSICS, 2007, : 39 - +
  • [7] SINGLE CHANNEL TARGET SPEAKER EXTRACTION AND RECOGNITION WITH SPEAKER BEAM
    Delcroix, Marc
    Zmolikova, Katerina
    Kinoshita, Keisuke
    Ogawa, Atsunori
    Nakatani, Tomohiro
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5554 - 5558
  • [8] Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
    Kanda, Naoyuki
    Horiguchi, Shota
    Takashima, Ryoichi
    Fujita, Yusuke
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. INTERSPEECH 2019, 2019, : 236 - 240
  • [9] SIMILARITY MEASURE FOR AUTOMATIC SPEECH AND SPEAKER RECOGNITION
    SCHROEDER, MR
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1968, 43 (02): : 375 - +
  • [10] Automatic speaker recognition with crosslanguage speech material
    Kuenzel, Hermann J.
    [J]. INTERNATIONAL JOURNAL OF SPEECH LANGUAGE AND THE LAW, 2013, 20 (01) : 21 - 44