Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

被引:0
|
作者
Zeng, Chang [1 ,2 ]
Miao, Xiaoxiao [3 ]
Wang, Xin [1 ]
Cooper, Erica [1 ]
Yamagishi, Junichi [1 ,2 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018340, Japan
[2] SOKENDAI Hayama, Kanagawa 2400193, Japan
[3] Singapore Inst Technol, Singapore, Singapore
来源
关键词
Automatic speaker verification; Deep learning; Neural network; Attention; Data augmentation; SOFTMAX; PLDA;
D O I
10.1016/j.csl.2024.101619
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional automatic speaker verification systems can usually be decomposed into a frontend model such as time delay neural network (TDNN) for extracting speaker embeddings and a back -end model such as statistics -based probabilistic linear discriminant analysis (PLDA) or neural network -based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back -end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized endto -end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame -level and utterance -level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini -batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding -level mixup strategy for better optimization.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] ATTENTION BACK-END FOR AUTOMATIC SPEAKER VERIFICATION WITH MULTIPLE ENROLLMENT UTTERANCES
    Zeng, Chang
    Wang, Xin
    Cooper, Erica
    Miao, Xiaoxiao
    Yamagishi, Junichi
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6717 - 6721
  • [2] Neural PLDA Modeling for End-to-End Speaker Verification
    Ramoji, Shreyas
    Krishnan, Prashant
    Ganapathy, Sriram
    [J]. INTERSPEECH 2020, 2020, : 4333 - 4337
  • [3] DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
    Snyder, David
    Ghahremani, Pegah
    Povey, Daniel
    Garcia-Romero, Daniel
    Carmiel, Yishay
    Khudanpur, Sanjeev
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 165 - 170
  • [4] End-To-End Phonetic Neural Network Approach for Speaker Verification
    Demirbag, Sedat
    Erden, Mustafa
    Arslan, Levent
    [J]. 2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [5] End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances
    Zhang, Chunlei
    Koishida, Kazuhito
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1487 - 1491
  • [6] GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
    Wan, Li
    Wang, Quan
    Papir, Alan
    Moreno, Ignacio Lopez
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4879 - 4883
  • [7] AN INVESTIGATION ON BACK-END FOR SPEAKER RECOGNITION IN MULTI-SESSION ENROLLMENT
    Liu, Gang
    Hasan, Taufiq
    Boril, Hynek
    Hansen, John H. L.
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7755 - 7759
  • [8] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
    Zhang, Chunlei
    Shi, Jiatong
    Weng, Chao
    Yu, Meng
    Yu, Dong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
  • [9] Effective Phase Encoding for End-to-end Speaker Verification
    Peng, Junyi
    Qu, Xiaoyang
    Gu, Rongzhi
    Wang, Jianzong
    Xiao, Jing
    Burget, Lukas
    Cernocky, Jan ''Honza''
    [J]. INTERSPEECH 2021, 2021, : 2366 - 2370
  • [10] End-to-End Text-Dependent Speaker Verification
    Heigold, Georg
    Moreno, Ignacio
    Bengio, Samy
    Shazeer, Noam
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5115 - 5119