Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

被引:0
|
作者
Zeng, Chang [1 ,2 ]
Miao, Xiaoxiao [3 ]
Wang, Xin [1 ]
Cooper, Erica [1 ]
Yamagishi, Junichi [1 ,2 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018340, Japan
[2] SOKENDAI Hayama, Kanagawa 2400193, Japan
[3] Singapore Inst Technol, Singapore, Singapore
来源
关键词
Automatic speaker verification; Deep learning; Neural network; Attention; Data augmentation; SOFTMAX; PLDA;
D O I
10.1016/j.csl.2024.101619
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional automatic speaker verification systems can usually be decomposed into a frontend model such as time delay neural network (TDNN) for extracting speaker embeddings and a back -end model such as statistics -based probabilistic linear discriminant analysis (PLDA) or neural network -based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back -end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized endto -end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame -level and utterance -level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini -batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding -level mixup strategy for better optimization.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    INTERSPEECH 2022, 2022, : 1461 - 1465
  • [32] End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification
    Heo, Hee-Soo
    Jung, Jee-weon
    Yang, IL-Ho
    Yoon, Sung-Hyun
    Shim, Hye-jin
    Yu, Ha-Jin
    INTERSPEECH 2019, 2019, : 4035 - 4039
  • [33] END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER VERIFICATION
    Zhang, Shi-Xiong
    Chen, Zhuo
    Zhao, Yong
    Li, Jinyu
    Gong, Yifan
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 171 - 178
  • [34] End-to-End Feature Learning for Text-Independent Speaker Verification
    Chen, Fangzhou
    Bian, Tengyue
    Xu, Li
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 3949 - 3954
  • [35] ADAPTING END-TO-END NEURAL SPEAKER VERIFICATION TO NEW LANGUAGES AND RECORDING CONDITIONS WITH ADVERSARIAL TRAINING
    Bhattacharya, Gautam
    Alam, Jahangir
    Kenny, Patrick
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6041 - 6045
  • [36] SVSNet: An End-to-End Speaker Voice Similarity Assessment Model
    Hu, Cheng-Hung
    Peng, Yu-Huai
    Yamagishi, Junichi
    Tsao, Yu
    Wang, Hsin-Min
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 767 - 771
  • [37] An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios
    Liu, Gang
    Hansen, John H. L.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1978 - 1992
  • [38] End-To-End Neural Speaker Diarization Through Step-Function
    Latypov, Rustam
    Stolov, Evgeni
    2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,
  • [39] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Nagamatsu, Kenji
    Watanabe, Shinji
    INTERSPEECH 2019, 2019, : 4300 - 4304
  • [40] ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER
    Xue, Yawen
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 841 - 848