Maximum Gaussianality training for deep speaker vector normalization

被引:2
|
作者
Cai, Yunqi [1 ,2 ,3 ]
Li, Lantian [4 ]
Abel, Andrew [3 ,5 ]
Zhu, Xiaoyan [3 ]
Wang, Dong [2 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650504, Peoples R China
[2] BNRist Tsinghua Univ, Ctr Speech & Language Technol CSLT, Beijing 100084, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci, Beijing 100084, Peoples R China
[4] Artificial Intelligence Beijing Univ Posts & Telec, Beijing, Peoples R China
[5] Univ Strathclyde, Dept Comp & Informat Sci, Glasgow, Scotland
关键词
Speaker embedding Normalization flow Gaussianality training; RECOGNITION;
D O I
10.1016/j.patcog.2023.109977
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic Speaker Verification (ASV) is a critical task in pattern recognition and has been applied to various security-sensitive scenarios. The current state-of-the-art technique for ASV is based on deep embedding. However, a significant challenge with this approach is that the resulting deep speaker vectors tend to be irregularly distributed. To address this issue, this paper proposes a novel training method called Maximum Gaussianality (MG), which regulates the distribution of the speaker vectors. Compared to the conventional normalization approach based on maximum likelihood (ML), the new approach directly maximizes the Gaussianality of the latent codes, and therefore can both normalize the between-class and within-class distributions in a controlled and reliable way and eliminate the unbound likelihood problem associated with the conventional ML approach. Our experiments on several datasets demonstrate that our MG-based normalization can deliver much better performance than the baseline systems without normalization and outperform discriminative normalization flow (DNF), an ML-based normalization method, particularly when the training data is limited. In theory, the MG criterion can be applied to any task in any research domain where Gaussian distributions are needed, making the MG training a versatile tool.
引用
下载
收藏
页数:12
相关论文
共 50 条
  • [31] Speaker independent acoustic modeling using speaker normalization
    Ishii, J
    Fukada, T
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 97 - 100
  • [32] SPEAKER ADAPTIVE TRAINING IN DEEP NEURAL NETWORKS USING SPEAKER DEPENDENT BOTTLENECK FEATURES
    Doddipatla, Rama
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5290 - 5294
  • [33] Discriminative training for speaker identification based on maximum model distance algorithm
    Hong, QY
    Kwong, S
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 25 - 28
  • [34] Maximum Likelihood i-vector Space Using PCA for Speaker Verification
    Lei, Zhenchun
    Yang, Yingchun
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2736 - 2739
  • [35] COMPARING MAXIMUM A POSTERIORI VECTOR QUANTIZATION AND GAUSSIAN MIXTURE MODELS IN SPEAKER VERIFICATION
    Kinnunen, Tomi
    Saastamoinen, Juhani
    Hautamaki, Ville
    Vinni, Mikko
    Franti, Pasi
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4229 - 4232
  • [36] SPEAKER CLUSTER-BASED SPEAKER ADAPTIVE TRAINING FOR DEEP NEURAL NETWORK ACOUSTIC MODELING
    Chu, Wei
    Chen, Ruxin
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5295 - 5299
  • [37] IMPROVED SPEAKER INDEPENDENT LIP READING USING SPEAKER ADAPTIVE TRAINING AND DEEP NEURAL NETWORKS
    Almajai, Ibrahim
    Cox, Stephen
    Harvey, Richard
    Lan, Yuxuan
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 2722 - 2726
  • [38] Speaker verification score normalization using speaker model clusters
    Apsingekar, Vijendra Raj
    De Leon, Phillip L.
    SPEECH COMMUNICATION, 2011, 53 (01) : 110 - 118
  • [39] FULL-INFO TRAINING FOR DEEP SPEAKER FEATURE LEARNING
    Li, Lantian
    Tang, Zhiyuan
    Wang, Dong
    Zheng, Thomas Fang
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5369 - 5373
  • [40] Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training
    Liu, Sheng
    Li, Xiao
    Zhai, Yuexiang
    You, Chong
    Zhu, Zhihui
    Fernandez-Granda, Carlos
    Qu, Qing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34