An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

被引:19
|
作者
Gao, Zhifu [1 ]
Song, Yan [1 ]
McLoughlin, Ian [2 ]
Guo, Wu [1 ]
Dai, Lirong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Anhui, Peoples R China
[2] Univ Kent, Sch Comp, Medway, England
基金
中国国家自然科学基金;
关键词
speaker verification; convolution neural network; dilated convolution; cross-convolutional-layer pooling;
D O I
10.21437/Interspeech.2018-1515
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for short duration speaker verification (SV). Existing deep learning based SV methods generally extract frontend embeddings from a feed-forward deep neural network, in which the long-term speaker characteristics are captured via a pooling operation over the input speech. The extracted embeddings are then scored via a backend model, such as Probabilistic Linear Discriminative Analysis (PLDA). Two improvements are proposed for frontend embedding learning based on the CNN structure: (1) Motivated by the WaveNet for speech synthesis, dilated filters are designed to achieve a tradeoff between computational efficiency and receptive-filter size; and (2) A novel cross-convolutional layer pooling method is exploited to capture 1st-order statistics for modelling long-term speaker characteristics. Specifically, the activations of one convolutional layer are aggregated with the guidance of the feature maps from the successive layer. To evaluate the effectiveness of our proposed methods, extensive experiments are conducted on the modified female portion of NIST SRE 2010 evaluations, with conditions ranging from 10s-10s to 5s-4s. Excellent performance has been achieved on each evaluation condition, significantly outperforming existing SV systems using i-vector and d-vector embeddings.
引用
收藏
页码:3578 / 3582
页数:5
相关论文
共 50 条
  • [1] Deep Speaker Embeddings for Short-Duration Speaker Verification
    Bhattacharya, Gautam
    Alam, Jahangir
    Kenny, Patrick
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1517 - 1521
  • [2] Deep Segment Attentive Embedding for Duration Robust Speaker Verification
    Liu, Bin
    Nie, Shuai
    Liu, Wenju
    Zhang, Hui
    Li, Xiangang
    Li, Changliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 822 - 826
  • [3] Deep Speaker Embedding with Long Short Term Centroid Learning for Text-independent Speaker Verification
    Peng, Junyi
    Gu, Rongzhi
    Zou, Yuexian
    INTERSPEECH 2020, 2020, : 3246 - 3250
  • [4] An Effective Deep Embedding Learning Architecture for Speaker Verification
    Jiang, Yiheng
    Song, Yan
    McLoughlin, Ian
    Gao, Zhifu
    Dai, Lirong
    INTERSPEECH 2019, 2019, : 4040 - 4044
  • [5] Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification
    Wang, Shuai
    Huang, Zili
    Qian, Yanmin
    Yu, Kai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1686 - 1696
  • [6] A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments
    Jung, Youngmoon
    Choi, Yeunju
    Lim, Hyungjun
    Kim, Hoirin
    IEEE ACCESS, 2020, 8 : 175448 - 175466
  • [7] Triplet-Center Loss Based Deep Embedding Learning Method for Speaker Verification
    Jiang, Yiheng
    Song, Yan
    Yan, Jie
    Dai, Lirong
    McLoughlin, Ian
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1625 - 1629
  • [8] Deep Embedding Learning for Text-Dependent Speaker Verification
    Zhang, Peng
    Hu, Peng
    Zhang, Xueliang
    INTERSPEECH 2020, 2020, : 3461 - 3465
  • [9] Transfer Learning for Speaker Verification with Short-Duration Audio
    Fathima, Noor
    Simha, J. B.
    Abhi, Shinu
    SMART TRENDS IN COMPUTING AND COMMUNICATIONS, VOL 5, SMARTCOM 2024, 2024, 949 : 195 - 205
  • [10] AN EFFECTIVE DEEP EMBEDDING LEARNING METHOD BASED ON DENSE-RESIDUAL NETWORKS FOR SPEAKER VERIFICATION
    Liu, Ying
    Song, Yan
    McLoughlin, Ian
    Liu, Lin
    Dai, Li-rong
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6683 - 6687