Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

被引:0
|
作者
Zeng, Chang [1 ,2 ]
Miao, Xiaoxiao [3 ]
Wang, Xin [1 ]
Cooper, Erica [1 ]
Yamagishi, Junichi [1 ,2 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018340, Japan
[2] SOKENDAI Hayama, Kanagawa 2400193, Japan
[3] Singapore Inst Technol, Singapore, Singapore
来源
关键词
Automatic speaker verification; Deep learning; Neural network; Attention; Data augmentation; SOFTMAX; PLDA;
D O I
10.1016/j.csl.2024.101619
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional automatic speaker verification systems can usually be decomposed into a frontend model such as time delay neural network (TDNN) for extracting speaker embeddings and a back -end model such as statistics -based probabilistic linear discriminant analysis (PLDA) or neural network -based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back -end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized endto -end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame -level and utterance -level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini -batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding -level mixup strategy for better optimization.
引用
下载
收藏
页数:19
相关论文
共 50 条
  • [21] Improved Relation Networks for End-to-End Speaker Verification and Identification
    Chaubey, Ashutosh
    Sinha, Sparsh
    Ghose, Susmita
    INTERSPEECH 2022, 2022, : 5085 - 5089
  • [22] Joint Training of Expanded End-to-end DNN for Text-dependent Speaker Verification
    Heo, Hee-soo
    Jung, Jee-weon
    Yang, Il-ho
    Yoon, Sung-hyun
    Yu, Ha-jin
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1532 - 1536
  • [23] SPEAKER VERIFICATION USING END-TO-END ADVERSARIAL LANGUAGE ADAPTATION
    Rohdin, Johan
    Stafylakis, Themos
    Silnova, Anna
    Zeinali, Hossein
    Burget, Lukas
    Plchot, Oldrich
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6006 - 6010
  • [24] End-to-end framework for spoof-aware speaker verification
    Kang, Woo Hyun
    Alam, Jahangir
    Fathan, Abderrahim
    INTERSPEECH 2022, 2022, : 4362 - 4366
  • [25] Strategies for End-to-End Text-Independent Speaker Verification
    Lin, Weiwei
    Mak, Man-Wai
    Chien, Jen-Tzung
    INTERSPEECH 2020, 2020, : 4308 - 4312
  • [26] Analysis of Length Normalization in End-to-End Speaker Verification System
    Cai, Weicheng
    Chen, Jinkun
    Li, Ming
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3618 - 3622
  • [27] An End-to-End Text-Independent Speaker Identification System on Short Utterances
    Ji, Ruifang
    Cai, Xinyuan
    Xu, Bo
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3628 - 3632
  • [28] Integrated Presentation Attack Detection and Automatic Speaker Verification: Common Features and Gaussian Back-end Fusion
    Todisco, Massimiliano
    Delgado, Hector
    Lee, Kong Aik
    Sahidullah, Md
    Evans, Nicholas
    Kinnunen, Tomi
    Yamagishi, Junichi
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 77 - 81
  • [29] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    INTERSPEECH 2022, 2022, : 1471 - 1475
  • [30] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303