CNN WITH PHONETIC ATTENTION FOR TEXT-INDEPENDENT SPEAKER VERIFICATION

被引:0
|
作者
Zhou, Tianyan [1 ]
Zhao, Yong [1 ]
Li, Jinyu [1 ]
Gong, Yifan [1 ]
Wu, Jian [1 ]
机构
[1] Microsoft Corp, Ft Collins, CO 80525 USA
关键词
speaker verification; attentive pooling; phonetic information;
D O I
10.1109/asru46091.2019.9003826
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-independent speaker verification imposes no constraints on the spoken content and usually needs long observations to make reliable prediction. In this paper, we propose two speaker embedding approaches by integrating the phonetic information into the attention-based residual convolutional neural network (CNN). Phonetic features are extracted from the bottleneck layer of a pretrained acoustic model. In implicit phonetic attention (IPA), the phonetic features are projected by a transformation network into multi-channel feature maps, and then combined with the raw acoustic features as the input of the CNN network. In explicit phonetic attention (EPA), the phonetic features are directly connected to the attentive pooling layer through a separate 1-dim CNN to generate the attention weights. With the incorporation of spoken content and attention mechanism, the system can not only distill the speaker-discriminant frames but also actively normalize the phonetic variations. Multi-head attention and discriminative objectives are further studied to improve the system. Experiments on the VoxCeleb corpus show our proposed system could outperform the state-of-the-art by around 43% relative.
引用
收藏
页码:718 / 725
页数:8
相关论文
共 50 条
  • [21] Text-Independent Speaker Verification Based on Triplet Loss
    He, Junjie
    He, Jing
    Zhu, Liangjin
    [J]. PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 2385 - 2388
  • [22] Score normalization for text-independent speaker verification systems
    Auckenthaler, R
    Carey, M
    Lloyd-Thomas, H
    [J]. DIGITAL SIGNAL PROCESSING, 2000, 10 (1-3) : 42 - 54
  • [23] Influence of task duration in text-independent speaker verification
    Fauve, Benoit
    Evans, Nicholas
    Pearson, Neil
    Bonastre, Jean-Francois
    Mason, John
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2728 - +
  • [24] Mixup Learning Strategies for Text-independent Speaker Verification
    Zhu, Yingke
    Ko, Tom
    Mak, Brian
    [J]. INTERSPEECH 2019, 2019, : 4345 - 4349
  • [25] Local Variability Vector for Text-Independent Speaker Verification
    Chen, Liping
    Lee, Kong Aik
    Ma, Bin
    Guo, Wu
    Li, Haizhou
    Dai, Li Rong
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 54 - +
  • [26] A robust sequential test for text-independent speaker verification
    Lund, MA
    Lee, CC
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (01): : 609 - 621
  • [27] Exploration of Local Variability in Text-Independent Speaker Verification
    Chen, Liping
    Lee, Kong Aik
    Ma, Bin
    Guo, Wu
    Li, Haizhou
    Dai, Li-Rong
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2016, 82 (02): : 217 - 228
  • [28] A New Score Normalization for Text-Independent Speaker Verification
    Ning, Hongke
    Zou, Y. X.
    Hu, Xuyan
    [J]. 2014 19TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2014, : 636 - 639
  • [29] Text-independent speaker verification:: State of the art and challenges
    Petrovska-Delacretaz, Dijana
    El Hannani, Asmaa
    Chollet, Gerard
    [J]. PROGRESS IN NONLINEAR SPEECH PROCESSING, 2007, 4391 : 135 - +
  • [30] Masked Proxy Loss For Text-Independent Speaker Verification
    Dan, Jiachen
    Kumar, Aiswarya Vinod
    Dhamyal, Hira
    Raj, Bhiksha
    Singh, Rita
    [J]. INTERSPEECH 2021, 2021, : 4638 - 4642