Target speaker recovery and recognition network with average x-vector and global training

被引:2
|
作者
Li, Wenjie [1 ,3 ]
Zhang, Pengyuan [1 ,3 ]
Yan, Yonghong [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Key Lab Speech Acoust & Content Understanding, Inst Acoust, Beijing, Peoples R China
[2] Chinese Acad Sci, Xinjiang Lab Minor Speech & Language Informat Pro, Xinjiang Tech Inst Phys & Chem, Urumqi, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
target speaker recovery; x-vector speaker embedding; speech recognition; global training;
D O I
10.21437/Interspeech.2019-1692
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
It is very challenging to do multi-talker automatic speech recognition (ASR). Some speaker-aware selective methods have been proposed to recover the speech of the target speaker, relying on the auxiliary speaker information provided by an anchor (a clean audio sample of the target speaker). But the performance is unstable depending on the quality of the provided anchors. To address this limitation, we propose to take advantage of the average speaker embeddings to build the target speaker recovery network (TRnet). The TRnet takes the mixed speech and the stable average speaker embeddings to produce the TF masks for the target speech. During training of the TRnet, we summarize the speaker embeddings on the whole training dataset for each speaker, instead of extracting on a randomly picked anchor. On the testing stage, one or very few anchors are enough to get decent recovery results. The results of the TRnet trained with average speaker embeddings show 13% and 12.5% relative improvements on WER and SDR, compared with the short-anchor trained model. Moreover, to mitigate the mismatch between the TRnet and the acoustic model (AM), we adopted two strategies: fine-tuning the AM and training an global TRnet. Both of them bring considerable reductions on WER. The results show that the global trained framework gets superior performance.
引用
收藏
页码:3233 / 3237
页数:5
相关论文
共 50 条
  • [1] A Study of X-vector Based Speaker Recognition on Short Utterances
    Kanagasundaram, A.
    Sridharan, S.
    Sriram, G.
    Prachi, S.
    Fookes, C.
    [J]. INTERSPEECH 2019, 2019, : 2943 - 2947
  • [2] Research on x-vector speaker recognition algorithm based on Kaldi
    Zhao, Hong
    Yue, Lupeng
    Wang, Weijie
    Zeng, Xiangyan
    [J]. INTERNATIONAL JOURNAL OF COMPUTING SCIENCE AND MATHEMATICS, 2022, 15 (03) : 199 - 212
  • [3] Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition
    Rozenberg, Shai
    Aronowitz, Hagai
    Hoory, Ron
    [J]. INTERSPEECH 2020, 2020, : 1526 - 1529
  • [4] Multi-task learning for X-vector based speaker recognition
    Zhang Y.
    Liu L.
    [J]. International Journal of Speech Technology, 2023, 26 (04) : 817 - 823
  • [5] X-vector DNN Refinement with Full-length Recordings for Speaker Recognition
    Garcia-Romero, Daniel
    Snyder, David
    Sell, Gregory
    McCree, Alan
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2019, 2019, : 1493 - 1496
  • [6] Design Choices for X-vector Based Speaker Anonymization
    Srivastava, Brij Mohan Lal
    Tomashenko, N.
    Wang, Xin
    Vincent, Emmanuel
    Yamagishi, Junichi
    Maouche, Mohamed
    Bellet, Aurelien
    Tommasi, Marc
    [J]. INTERSPEECH 2020, 2020, : 1713 - 1717
  • [7] Privacy and Utility of X-Vector Based Speaker Anonymization
    Srivastava, Brij Mohan Lal
    Maouche, Mohamed
    Sahidullah, Md
    Vincent, Emmanuel
    Bellet, Aurelien
    Tommasi, Marc
    Tomashenko, Natalia
    Wang, Xin
    Yamagishi, Junichi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2383 - 2395
  • [8] STATISTICS POOLING TIME DELAY NEURAL NETWORK BASED ON X-VECTOR FOR SPEAKER VERIFICATION
    Hong, Qian-Bei
    Wu, Chung-Hsien
    Wang, Hsin-Min
    Huang, Chien-Lin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6849 - 6853
  • [9] Speaker Recognition using Multiple X-Vector Speaker Representations with Two-Stage Clustering and Outlier Detection Refinement
    Shrestha, Roman
    Glackin, Cornelius
    Wall, Julie
    Cannings, Nigel
    Rajwadi, Marvin
    Kada, Satya
    Laird, James
    Laird, Thea
    Woodruff, Chris
    [J]. 2022 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2022, : 330 - 335
  • [10] Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments
    MohammadAmini, Mohammad
    Matrouf, Driss
    [J]. 28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 376 - 380