Application of Channel Attention for Speaker Recognition in the Wild

被引:0
|
作者
Chen, Zhi [1 ]
Wang, Lei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
来源
PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21) | 2021年
关键词
Speaker recognition; speaker verification; channel attention; NetVLAD; prototypical networks loss;
D O I
10.1145/3469213.3470331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of this paper is to build a speaker recognition system 'in the wild' (utterances with different lengths and irrelevant signals). The key elements of designing the deep neural network for this task are the type of backbone (frame-level) network, the time aggregation (utterance-level) method and the loss function (optimisation). We propose an effective speaker recognition system based on deep neural network, using SE-ResNet to extract speaker frame-level features, and a dictionary based NetVLAD or GhostVLAD to aggregate features along the time domain. We also point out that the superiority of NetVlAD plus SE-Block is that they are all based on channel attention. Additionally, we used prototypical networks loss, which learns a metric space in which the open-set classification task can be implemented by calculating the distance to the prototype representation of each class (the training process is consistent with the test scenario). We also study the influence of utterance length on the network and conclude that longer length is beneficial for "in the wild" data. Furthermore, we present results that suggest adapting from a model trained with English dataset can work on Mandarin speaker recognition, that is to say, the representations learned by our systems transfer well across different languages.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] The Application of Fusion Technology for Speaker Recognition
    Ping, Wang He
    Xia, Pan Hong
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (12): : 300 - 303
  • [22] Rapid channel compensation for speaker verification in the NIST 2000 speaker recognition evaluation
    Pelecanos, J.
    Sridharan, S.
    Acoustics Australia, 2001, 29 (01) : 17 - 20
  • [23] Discriminative Deep Audio Feature Embedding for Speaker Recognition in the Wild
    Bianco, Simone
    Cereda, Elia
    Napoletano, Paolo
    2018 IEEE 8TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - BERLIN (ICCE-BERLIN), 2018,
  • [24] Speaker recognition system in multi-channel environment
    Sang, LF
    Wu, ZH
    Yang, YC
    2003 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-5, CONFERENCE PROCEEDINGS, 2003, : 3116 - 3121
  • [25] Channel Robust MFCCs for Continuous Speech Speaker Recognition
    Chougule, Sharada Vikram
    Chavan, Mahesh S.
    ADVANCES IN SIGNAL PROCESSING AND INTELLIGENT RECOGNITION SYSTEMS, 2014, 264 : 557 - 568
  • [26] Channel and speaker adaptation techniques for robust speech recognition
    Chen, Jingdong
    Yao, Lei
    Huang, Taiyi
    Shengxue Xuebao/Acta Acustica, 1998, 23 (06): : 537 - 544
  • [27] The NIST SRE Summed Channel Speaker Recognition System
    Sun, Hanwu
    Ma, Bin
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1111 - 1114
  • [28] Robust Speaker Recognition in Cross-channel Condition
    Shan, Yuxiang
    Liu, Jia
    PROCEEDINGS OF THE 2009 2ND INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOLS 1-9, 2009, : 4344 - 4348
  • [29] Fingerspelling recognition in the wild with iterative visual attention
    Shi, Bowen
    Del Rio, Aurora Martinez
    Keane, Jonathan
    Brentari, Diane
    Shakhnarovich, Greg
    Livescu, Karen
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5399 - 5408
  • [30] Age and gender recognition in the wild with deep attention
    Rodriguez, Pau
    Cucurull, Guillem
    Gonfausb, Josep M.
    Roca, F. Xavier
    Gonzalez, Jordi
    PATTERN RECOGNITION, 2017, 72 : 563 - 571