Self Attention Networks in Speaker Recognition

被引:2
|
作者
Safari, Pooyan [1 ]
India, Miquel [1 ]
Hernando, Javier [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona 08034, Spain
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 11期
关键词
speaker recognition; self-attention networks; transformer; speaker embeddings; SPEECH; REPRESENTATION;
D O I
10.3390/app13116410
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other hand, the utilization of large-scale, multi-purpose language models trained through self-supervision is progressively more prevalent, for tasks like speech recognition. In this context, the pre-trained model, which has been trained on extensive speech data, can be fine-tuned for particular downstream tasks like speaker verification. These massive models typically rely on SANs as their foundational architecture. Therefore, studying the potential capabilities and training challenges of such models is of utmost importance for the future generation of speaker verification systems. In this direction, we propose a speaker embedding extractor based on SANs to obtain a discriminative speaker representation given non-fixed length speech utterances. With the advancements suggested in this work, we could achieve up to 41% relative performance improvement in terms of EER compared to the naive SAN which was proposed in our previous work. Moreover, we empirically show the training instability in such architectures in terms of rank collapse and further investigate the potential solutions to alleviate this shortcoming.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Bayesian networks in multimodal speech recognition and speaker identification
    Nefian, AV
    Liang, LH
    CONFERENCE RECORD OF THE THIRTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1 AND 2, 2003, : 2004 - 2008
  • [32] Speaker Recognition Using Neural Networks and Conventional Classifiers
    Farrell, Kevin R.
    Mammone, Richard J.
    Assaleh, Khaled T.
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (01): : 194 - 205
  • [33] Correlation Networks for Speaker Normalization in Automatic Speech Recognition
    Sharon, Rini A.
    Kothinti, Sandeep Reddy
    Umesh, Srinivasan
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 882 - 886
  • [34] AN APPLICATION OF SPEAKER RECOGNITION USING ARTIFICIAL NEURAL NETWORKS
    Caner, Murat
    Ustun, Seydi Vakkas
    PAMUKKALE UNIVERSITY JOURNAL OF ENGINEERING SCIENCES-PAMUKKALE UNIVERSITESI MUHENDISLIK BILIMLERI DERGISI, 2006, 12 (02): : 279 - 284
  • [35] Speaker recognition using convolutional siamese neural networks
    Jung H.
    Yoon S.
    Park N.
    Transactions of the Korean Institute of Electrical Engineers, 2020, 69 (01): : 164 - 169
  • [36] Speaker recognition using pulse coupled neural networks
    Timoszczuk, Antonio Pedro
    Cabral, Euvaldo F., Jr.
    2007 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-6, 2007, : 1965 - +
  • [37] Contrastive Adversarial Domain Adaptation Networks for Speaker Recognition
    Li, Longxin
    Mak, Man-Wai
    Chien, Jen-Tzung
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (05) : 2236 - 2245
  • [38] Which to select?: Analysis of speaker representation with graph attention networks
    Shim, Hye-jin
    Jung, Jee-weon
    Yu, Ha-Jin
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2024, 156 (04): : 2701 - 2708
  • [39] Which to select?: Analysis of speaker representation with graph attention networks
    Shim, Hye-Jin
    Jung, Jee-Weon
    Yu, Ha-Jin
    Journal of the Acoustical Society of America, 1600, 156 (04): : 2701 - 2708
  • [40] Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification
    Shi, Yanpei
    Huang, Qiang
    Hain, Thomas
    INTERSPEECH 2020, 2020, : 2992 - 2996