TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition

被引:7
|
作者
Li, Wenjie [1 ]
Zhang, Pengyuan [1 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Chinese Acad Sci, Xinjiang Lab Minor Speech & Language Informat Pro, Xinjiang Tech Inst Phys & Chem, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
SEPARATION;
D O I
10.1049/el.2019.1228
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
It is challenging to perform automatic speech recognition when multiple people talk simultaneously. To solve this problem, speaker-aware selective methods have been proposed to extract the speech of the target speaker, relying on the auxiliary speaker characteristics provided by an anchor (a clean audio sample of the target speaker). However, the extraction performance depends on the duration and quality of the anchors, which is unstable. To address this limitation, the authors propose a target speaker extraction network (TEnet) which applies the robust speaker embedding to extract the target speech from the speech mixture. To get more stable speaker characteristics during training, the robust speaker embeddings are accumulated over all the speech of each target speaker, rather than utilising the embedding produced by a single anchor. As for testing, very few anchors are enough to get decent extraction performance. Results show the TEnet trained with accumulated embedding achieves better performance and robustness compared with the single-anchored TEnet. Moreover, to exploit the potential of the speaker embedding, the authors propose to feed the extracted target speech as anchor and train a feedback TEnet, whose results are superior to the short-anchored baseline for 22.5% on word error rate and 15.5% on signal-to-distortion rate.
引用
收藏
页码:816 / 818
页数:3
相关论文
共 50 条
  • [31] Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition
    Fahad, Md Shah
    Ranjan, Ashish
    Deepak, Akshay
    Pradhan, Gayadhar
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2022, 41 (11) : 6113 - 6135
  • [32] Speaker Adversarial Neural Network (SANN) for Speaker-independent Speech Emotion Recognition
    Md Shah Fahad
    Ashish Ranjan
    Akshay Deepak
    Gayadhar Pradhan
    Circuits, Systems, and Signal Processing, 2022, 41 : 6113 - 6135
  • [33] Analysis of Compressed Speech Signals in an Automatic Speaker Recognition System
    Metzger, Richard A.
    Doherty, John F.
    Jenkins, David M.
    2015 49TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2015,
  • [34] SPEAKER-ADAPTABLE CLASSIFICATION PROCEDURE FOR AUTOMATIC SPEECH RECOGNITION
    KATTERFELDT, H
    THON, W
    NACHRICHTENTECHNISCHE ZEITSCHRIFT, 1974, 27 (06): : 230 - 232
  • [35] DYNAMIC FREQUENCY WARPING FOR SPEAKER ADAPTATION IN AUTOMATIC SPEECH RECOGNITION
    PALIWAL, KK
    AINSWORTH, WA
    JOURNAL OF PHONETICS, 1985, 13 (02) : 123 - 134
  • [36] Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection
    Moriya, Takafumi
    Sato, Hiroshi
    Ochiai, Tsubasa
    Delcroix, Marc
    Shinozaki, Takahiro
    IEEE ACCESS, 2023, 11 : 13906 - 13917
  • [37] Automatic speaker recognition using dynamic Bayesian network
    Sang, LF
    Wu, ZH
    Yang, YC
    Zhang, WF
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 188 - 191
  • [38] Automatic speaker recognition using dynamic Bayesian network
    Sang, LF
    Wu, ZH
    Yang, YC
    Zhang, WF
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 613 - 616
  • [39] AN INTRODUCTION TO SPEECH AND SPEAKER RECOGNITION
    PEACOCKE, RD
    GRAF, DH
    COMPUTER, 1990, 23 (08) : 26 - 33
  • [40] An Electroglottograph Auxiliary Neural Network for Target Speaker Extraction
    Chen, Lijiang
    Mo, Zhendong
    Ren, Jie
    Cui, Chunfeng
    Zhao, Qi
    APPLIED SCIENCES-BASEL, 2023, 13 (01):