Self Attention Networks in Speaker Recognition

被引:2
|
作者
Safari, Pooyan [1 ]
India, Miquel [1 ]
Hernando, Javier [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona 08034, Spain
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 11期
关键词
speaker recognition; self-attention networks; transformer; speaker embeddings; SPEECH; REPRESENTATION;
D O I
10.3390/app13116410
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Recently, there has been a significant surge of interest in Self-Attention Networks (SANs) based on the Transformer architecture. This can be attributed to their notable ability for parallelization and their impressive performance across various Natural Language Processing applications. On the other hand, the utilization of large-scale, multi-purpose language models trained through self-supervision is progressively more prevalent, for tasks like speech recognition. In this context, the pre-trained model, which has been trained on extensive speech data, can be fine-tuned for particular downstream tasks like speaker verification. These massive models typically rely on SANs as their foundational architecture. Therefore, studying the potential capabilities and training challenges of such models is of utmost importance for the future generation of speaker verification systems. In this direction, we propose a speaker embedding extractor based on SANs to obtain a discriminative speaker representation given non-fixed length speech utterances. With the advancements suggested in this work, we could achieve up to 41% relative performance improvement in terms of EER compared to the naive SAN which was proposed in our previous work. Moreover, we empirically show the training instability in such architectures in terms of rank collapse and further investigate the potential solutions to alleviate this shortcoming.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Insights into Deep Neural Networks for Speaker Recognition
    Garcia-Romero, Daniel
    McCree, Alan
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1141 - 1145
  • [22] Speaker recognition using artificial neural networks
    Mueen, F
    Ahmed, A
    Sanaullah
    Gaba, A
    ISCON 2002: IEEE STUDENTS CONFERENCE ON EMERGING TECHNOLOGIES, PROCEEDINGS, 2002, : 99 - 102
  • [23] STACKED AUTOENCODER NETWORKS BASED SPEAKER RECOGNITION
    Zeng, Chun-Yan
    Ma, Chao-Feng
    Wang, Zhi-Feng
    Ye, Jia-Xiang
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 1, 2018, : 294 - 299
  • [24] Application of Split Residual Multilevel Attention Network in Speaker Recognition
    Wang, Jiji
    Deng, Fei
    Deng, Lihong
    Gao, Ping
    Huang, Yuanxiang
    IEEE ACCESS, 2023, 11 : 89359 - 89368
  • [25] Deep CNNs With Self-Attention for Speaker Identification
    Nguyen Nang An
    Nguyen Quang Thanh
    Liu, Yanbing
    IEEE ACCESS, 2019, 7 : 85327 - 85337
  • [26] Speaker Adaptive Training for Speech Recognition Based on Attention-over-Attention Mechanism
    Wan, Genshun
    Pan, Jia
    Wang, Qingran
    Gao, Jianqing
    Ye, Zhongfu
    INTERSPEECH 2020, 2020, : 1251 - 1255
  • [27] Self-Attention Networks For Motion Posture Recognition Based On Data Fusion
    Ji, Zhihao
    Xie, Qiang
    4TH INTERNATIONAL CONFERENCE ON INFORMATICS ENGINEERING AND INFORMATION SCIENCE (ICIEIS2021), 2022, 12161
  • [28] Self-attention Networks for Non-recurrent Handwritten Text Recognition
    d'Arce, Rafael
    Norton, Terence
    Hannuna, Sion
    Cristianini, Nello
    FRONTIERS IN HANDWRITING RECOGNITION, ICFHR 2022, 2022, 13639 : 389 - 403
  • [29] Self-Attention Networks for Human Activity Recognition Using Wearable Devices
    Betancourt, Carlos
    Chen, Wen-Hui
    Kuan, Chi-Wei
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 1194 - 1199
  • [30] Speaker Recognition Based on MFCC and BP Neural Networks
    Wang, Yi
    Lawlor, Bob
    2017 28TH IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2017,