Application of Channel Attention for Speaker Recognition in the Wild

被引:0
|
作者
Chen, Zhi [1 ]
Wang, Lei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
关键词
Speaker recognition; speaker verification; channel attention; NetVLAD; prototypical networks loss;
D O I
10.1145/3469213.3470331
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of this paper is to build a speaker recognition system 'in the wild' (utterances with different lengths and irrelevant signals). The key elements of designing the deep neural network for this task are the type of backbone (frame-level) network, the time aggregation (utterance-level) method and the loss function (optimisation). We propose an effective speaker recognition system based on deep neural network, using SE-ResNet to extract speaker frame-level features, and a dictionary based NetVLAD or GhostVLAD to aggregate features along the time domain. We also point out that the superiority of NetVlAD plus SE-Block is that they are all based on channel attention. Additionally, we used prototypical networks loss, which learns a metric space in which the open-set classification task can be implemented by calculating the distance to the prototype representation of each class (the training process is consistent with the test scenario). We also study the influence of utterance length on the network and conclude that longer length is beneficial for "in the wild" data. Furthermore, we present results that suggest adapting from a model trained with English dataset can work on Mandarin speaker recognition, that is to say, the representations learned by our systems transfer well across different languages.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Application of Split Residual Multilevel Attention Network in Speaker Recognition
    Wang, Jiji
    Deng, Fei
    Deng, Lihong
    Gao, Ping
    Huang, Yuanxiang
    IEEE ACCESS, 2023, 11 : 89359 - 89368
  • [2] SUPERVISED ATTENTION FOR SPEAKER RECOGNITION
    Kye, Seong Min
    Chung, Joon Son
    Kim, Hoirin
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 286 - 293
  • [3] Self Attention Networks in Speaker Recognition
    Safari, Pooyan
    India, Miquel
    Hernando, Javier
    APPLIED SCIENCES-BASEL, 2023, 13 (11):
  • [4] The Speakers in the Wild (SITW) Speaker Recognition Database
    McLaren, Mitchell
    Ferrer, Luciana
    Castan, Diego
    Lawson, Aaron
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 818 - 822
  • [5] The 2016 Speakers in the Wild Speaker Recognition Evaluation
    McLaren, Mitchell
    Ferrer, Luciana
    Castan, Diego
    Lawson, Aaron
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 823 - 827
  • [6] Speaker and Channel Factors in Text-Dependent Speaker Recognition
    Stafylakis, Themos
    Kenny, Patrick
    Alam, Md. Jahangir
    Kockmann, Marcel
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (01) : 65 - 78
  • [7] SINGLE CHANNEL TARGET SPEAKER EXTRACTION AND RECOGNITION WITH SPEAKER BEAM
    Delcroix, Marc
    Zmolikova, Katerina
    Kinoshita, Keisuke
    Ogawa, Atsunori
    Nakatani, Tomohiro
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5554 - 5558
  • [8] Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge
    Liu, Yi
    Tian, Yao
    He, Liang
    Liu, Jia
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 853 - 857
  • [9] ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?
    Wang, Qiongqiong
    Okabe, Koji
    Lee, Kong Aik
    Yamamoto, Hitoshi
    Koshinaka, Takafumi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1052 - 1059
  • [10] Cohort based speaker model synthesis for channel robust speaker recognition
    Wu, Wei
    Zheng, Thomas Fang
    Xu, Mingxing
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 893 - 896