Application of Channel Attention for Speaker Recognition in the Wild

被引：0

作者：

Chen, Zhi ^{[1
]}

Wang, Lei ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

来源：

PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21) | 2021年

关键词：

Speaker recognition; speaker verification; channel attention; NetVLAD; prototypical networks loss;

D O I：

10.1145/3469213.3470331

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The objective of this paper is to build a speaker recognition system 'in the wild' (utterances with different lengths and irrelevant signals). The key elements of designing the deep neural network for this task are the type of backbone (frame-level) network, the time aggregation (utterance-level) method and the loss function (optimisation). We propose an effective speaker recognition system based on deep neural network, using SE-ResNet to extract speaker frame-level features, and a dictionary based NetVLAD or GhostVLAD to aggregate features along the time domain. We also point out that the superiority of NetVlAD plus SE-Block is that they are all based on channel attention. Additionally, we used prototypical networks loss, which learns a metric space in which the open-set classification task can be implemented by calculating the distance to the prototype representation of each class (the training process is consistent with the test scenario). We also study the influence of utterance length on the network and conclude that longer length is beneficial for "in the wild" data. Furthermore, we present results that suggest adapting from a model trained with English dataset can work on Mandarin speaker recognition, that is to say, the representations learned by our systems transfer well across different languages.

引用

页数：5

共 50 条

[1] Application of Split Residual Multilevel Attention Network in Speaker Recognition
Wang, Jiji
Deng, Fei
Deng, Lihong
Gao, Ping
Huang, Yuanxiang
IEEE ACCESS, 2023, 11 : 89359 - 89368
[2] SUPERVISED ATTENTION FOR SPEAKER RECOGNITION
Kye, Seong Min
Chung, Joon Son
Kim, Hoirin
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 286 - 293
[3] Self Attention Networks in Speaker Recognition
Safari, Pooyan
India, Miquel
Hernando, Javier
APPLIED SCIENCES-BASEL, 2023, 13 (11):
[4] The Speakers in the Wild (SITW) Speaker Recognition Database
McLaren, Mitchell
Ferrer, Luciana
Castan, Diego
Lawson, Aaron
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 818 - 822
[5] The 2016 Speakers in the Wild Speaker Recognition Evaluation
McLaren, Mitchell
Ferrer, Luciana
Castan, Diego
Lawson, Aaron
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 823 - 827
[6] Speaker and Channel Factors in Text-Dependent Speaker Recognition
Stafylakis, Themos
Kenny, Patrick
Alam, Md. Jahangir
Kockmann, Marcel
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (01) : 65 - 78
[7] SINGLE CHANNEL TARGET SPEAKER EXTRACTION AND RECOGNITION WITH SPEAKER BEAM
Delcroix, Marc
Zmolikova, Katerina
Kinoshita, Keisuke
Ogawa, Atsunori
Nakatani, Tomohiro
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5554 - 5558
[8] Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge
Liu, Yi
Tian, Yao
He, Liang
Liu, Jia
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 853 - 857
[9] ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?
Wang, Qiongqiong
Okabe, Koji
Lee, Kong Aik
Yamamoto, Hitoshi
Koshinaka, Takafumi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1052 - 1059
[10] Cohort based speaker model synthesis for channel robust speaker recognition
Wu, Wei
Zheng, Thomas Fang
Xu, Mingxing
2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 893 - 896

← 1 2 3 4 5 →