Bayesian Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

被引:6
|
作者
Zhu, Yingke [1 ]
Mak, Brian [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Speaker verification; deep neural network; self-attention; speaker embedding; x-vectors;
D O I
10.1109/TASLP.2023.3244502
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Learning effective and discriminative speaker embed dings is a crucial task in speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. In our previous work, we relaxed this assumption and computed the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights were automatically determined by a self-attention mechanism. The effect of multiple attention heads have also been investigated to capture different aspects of a speaker's input speech. One challenge for multi-head attention is the information redundancy problem. If there is no constraint during the training of multi-head attention, different heads may extract similar attentive features, leading to the attention redundancy problem. In this paper, we generalize the deterministic multi-head attention to a Bayesian attention framework, and provide a new understanding of multi head attention from a Bayesian perspective. Under the Bayesian framework, we adopt the recently developed sampling method in optimization, which explicitly enforces the repulsiveness among the multiple heads. Systematic evaluation of the proposed Bayesian self-attentive speaker embeddings is performed on VoxCeleb and SITW evaluation sets. Significant and consistent improvements over other multi-head attention systems are achieved on all the evaluation datasets. The best Bayesian system with eight heads improves the EER by around 26% on VoxCeleb and 9% on SITW over the single-head baseline.
引用
下载
收藏
页码:1000 / 1012
页数:13
相关论文
共 50 条
  • [1] Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification
    Zhu, Yingke
    Ko, Tom
    Snyder, David
    Mak, Brian
    Povey, Daniel
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3573 - 3577
  • [2] Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification
    Bhattacharya, Gautam
    Alam, Jahangir
    Gupta, Vishwa
    Kenny, Patrick
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3588 - 3592
  • [3] Group-based speaker embeddings for text-independent speaker verification
    Jung, Youngmoon
    Eom, Youngsik
    Lee, Yeonghyeon
    Kim, Hoirin
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 496 - 502
  • [4] GRAPH ATTENTIVE FEATURE AGGREGATION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Shim, Hye-Jin
    Heo, Jungwoo
    Park, Jae-Han
    Lee, Ga-Hui
    Yu, Ha-Jin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7972 - 7976
  • [5] Deep Neural Network Embeddings for Text-Independent Speaker Verification
    Snyder, David
    Garcia-Romero, Daniel
    Povey, Daniel
    Khudanpur, Sanjeev
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 999 - 1003
  • [6] Vector-Based Attentive Pooling for Text-Independent Speaker Verification
    Wu, Yanfeng
    Guo, Chenkai
    Gao, Hongcan
    Hou, Xiaolei
    Xu, Jing
    INTERSPEECH 2020, 2020, : 936 - 940
  • [7] Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
    Seo, Soonshin
    Kim, Ji-Hwan
    ELECTRONICS, 2020, 9 (10) : 1 - 14
  • [8] A tutorial on text-independent speaker verification
    Bimbot, F. (bimbot@irisa.fr), 1600, Hindawi Publishing Corporation (2004):
  • [9] A tutorial on text-independent speaker verification
    Bimbot, F
    Bonastre, JF
    Fredouille, C
    Gravier, G
    Magrin-Chagnolleau, I
    Meignier, S
    Merlin, T
    Ortega-García, J
    Petrovska-Delacrétaz, D
    Reynolds, DA
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2004, 2004 (04) : 430 - 451
  • [10] A Tutorial on Text-Independent Speaker Verification
    Frédéric Bimbot
    Jean-François Bonastre
    Corinne Fredouille
    Guillaume Gravier
    Ivan Magrin-Chagnolleau
    Sylvain Meignier
    Teva Merlin
    Javier Ortega-García
    Dijana Petrovska-Delacrétaz
    Douglas A. Reynolds
    EURASIP Journal on Advances in Signal Processing, 2004