MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION

被引:20
|
作者
Wang, Rui [1 ,4 ]
Ao, Junyi [2 ,3 ,4 ]
Zhou, Long [4 ]
Liu, Shujie [4 ]
Wei, Zhihua [1 ]
Ko, Tom [2 ]
Li, Qing [3 ]
Zhang, Yu [2 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China
[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Guangdong, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
关键词
speaker recognition; Transformer; speaker identification; speaker verification;
D O I
10.1109/ICASSP43922.2022.9746639
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
引用
收藏
页码:6732 / 6736
页数:5
相关论文
共 50 条
  • [41] Multi-modal Scene Recognition Based on Global Self-attention Mechanism
    Li, Xiang
    Sun, Ning
    Liu, Jixin
    Chai, Lei
    Sun, Haian
    [J]. ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 109 - 121
  • [42] iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks
    Akbar, Shahid
    Zou, Quan
    Raza, Ali
    Alarfaj, Fawaz Khaled
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 151
  • [43] Deep CNNs With Self-Attention for Speaker Identification
    Nguyen Nang An
    Nguyen Quang Thanh
    Liu, Yanbing
    [J]. IEEE ACCESS, 2019, 7 : 85327 - 85337
  • [44] Multi-view clustering based on view-attention driven
    Ma, Zhifeng
    Yu, Junyang
    Wang, Longge
    Chen, Huazhu
    Zhao, Yuxi
    He, Xin
    Wang, Yingqi
    Song, Yalin
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (08) : 2621 - 2631
  • [45] Multi-view clustering based on view-attention driven
    Zhifeng Ma
    Junyang Yu
    Longge Wang
    Huazhu Chen
    Yuxi Zhao
    Xin He
    Yingqi Wang
    Yalin Song
    [J]. International Journal of Machine Learning and Cybernetics, 2023, 14 : 2621 - 2631
  • [46] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
    Alfasly, Saghir
    Chui, Charles K.
    Jiang, Qingtang
    Lu, Jian
    Xu, Chen
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
  • [47] Attention-based network for effective action recognition from multi-view video
    Hoang-Thuyen Nguyen
    Thi-Oanh Nguyen
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 971 - 980
  • [48] Multi-view representation learning for multi-view action recognition
    Hao, Tong
    Wu, Dan
    Wang, Qian
    Sun, Jin-Sheng
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 48 : 453 - 460
  • [49] Polarimetric HRRP Recognition Based on ConvLSTM With Self-Attention
    Zhang, Liang
    Li, Yang
    Wang, Yanhua
    Wang, Junfu
    Long, Teng
    [J]. IEEE SENSORS JOURNAL, 2021, 21 (06) : 7884 - 7898
  • [50] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
    Liang, Chengdong
    Xu, Menglong
    Zhang, Xiao-Lei
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499