Self Attention Networks in Speaker Recognition

被引:2
|
作者
Safari, Pooyan [1 ]
India, Miquel [1 ]
Hernando, Javier [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona 08034, Spain
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 11期
关键词
speaker recognition; self-attention networks; transformer; speaker embeddings; SPEECH; REPRESENTATION;
D O I
10.3390/app13116410
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other hand, the utilization of large-scale, multi-purpose language models trained through self-supervision is progressively more prevalent, for tasks like speech recognition. In this context, the pre-trained model, which has been trained on extensive speech data, can be fine-tuned for particular downstream tasks like speaker verification. These massive models typically rely on SANs as their foundational architecture. Therefore, studying the potential capabilities and training challenges of such models is of utmost importance for the future generation of speaker verification systems. In this direction, we propose a speaker embedding extractor based on SANs to obtain a discriminative speaker representation given non-fixed length speech utterances. With the advancements suggested in this work, we could achieve up to 41% relative performance improvement in terms of EER compared to the naive SAN which was proposed in our previous work. Moreover, we empirically show the training instability in such architectures in terms of rank collapse and further investigate the potential solutions to alleviate this shortcoming.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Self Multi-Head Attention for Speaker Recognition
    India, Miquel
    Safari, Pooyan
    Hernando, Javier
    INTERSPEECH 2019, 2019, : 4305 - 4309
  • [2] Self-Attention Encoding and Pooling for Speaker Recognition
    Safari, Pooyan
    India, Miquel
    Hernando, Javier
    INTERSPEECH 2020, 2020, : 941 - 945
  • [3] SUPERVISED ATTENTION FOR SPEAKER RECOGNITION
    Kye, Seong Min
    Chung, Joon Son
    Kim, Hoirin
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 286 - 293
  • [4] Self-Attention Networks for Text-Independent Speaker Verification
    Bian, Tengyue
    Chen, Fangzhou
    Xu, Li
    PROCEEDINGS OF THE 2019 31ST CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2019), 2019, : 3955 - 3960
  • [5] MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION
    Wang, Rui
    Ao, Junyi
    Zhou, Long
    Liu, Shujie
    Wei, Zhihua
    Ko, Tom
    Li, Qing
    Zhang, Yu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6732 - 6736
  • [6] Self-attention is What You Need to Fool a Speaker Recognition System
    Wang, Fangwei
    Song, Ruixin
    Tan, Zhiyuan
    Li, Qingru
    Wang, Changguang
    Yang, Yong
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 929 - 936
  • [7] Emotion embedding framework with emotional self-attention mechanism for speaker recognition
    Li, Dongdong
    Yang, Zhuo
    Liu, Jinlin
    Yang, Hai
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [8] GRAPH ATTENTION NETWORKS FOR SPEAKER VERIFICATION
    Jung, Jee-weon
    Heo, Hee-Soo
    Yu, Ha-Jin
    Chung, Joon Son
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6149 - 6153
  • [9] Self-attention based speaker recognition using Cluster-Range Loss
    Bian, Tengyue
    Chen, Fangzhou
    Xu, Li
    NEUROCOMPUTING, 2019, 368 : 59 - 68
  • [10] Application of Channel Attention for Speaker Recognition in the Wild
    Chen, Zhi
    Wang, Lei
    PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,