MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

被引:4
|
作者
Zheng, Qiuyu [1 ]
Chen, Zengzhao [1 ]
Wang, Zhifeng [1 ]
Liu, Hai [1 ,2 ]
Lin, Mengting [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
[2] Cent China Normal Univ, Natl Engn Res Ctr Educ Big Data, Wuhan 430079, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Transformer; Speaker embeddings; Selective kernel; Frame-level feature; Utterance-level feature; IDENTIFICATION; NETWORK;
D O I
10.1016/j.eswa.2023.123004
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer models have demonstrated superior performance across various domains, including computer vision, natural language processing, and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse -level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree.
引用
收藏
页数:14
相关论文
共 17 条
  • [1] Deep Speaker Embedding with Frame-Constrained Training Strategy for Speaker Verification
    Gu, Bin
    INTERSPEECH 2022, 2022, : 1451 - 1455
  • [2] An Effective Deep Embedding Learning Architecture for Speaker Verification
    Jiang, Yiheng
    Song, Yan
    McLoughlin, Ian
    Gao, Zhifu
    Dai, Lirong
    INTERSPEECH 2019, 2019, : 4040 - 4044
  • [3] Deep Segment Attentive Embedding for Duration Robust Speaker Verification
    Liu, Bin
    Nie, Shuai
    Liu, Wenju
    Zhang, Hui
    Li, Xiangang
    Li, Changliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 822 - 826
  • [4] Deep Embedding Learning for Text-Dependent Speaker Verification
    Zhang, Peng
    Hu, Peng
    Zhang, Xueliang
    INTERSPEECH 2020, 2020, : 3461 - 3465
  • [5] Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification
    Peng, Junyi
    Gu, Rongzhi
    Zou, Yuexian
    Wangt, Wenwu
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 314 - 319
  • [6] Deep Speaker Embedding with Long Short Term Centroid Learning for Text-independent Speaker Verification
    Peng, Junyi
    Gu, Rongzhi
    Zou, Yuexian
    INTERSPEECH 2020, 2020, : 3246 - 3250
  • [7] DEEP SPEAKER EMBEDDING LEARNING WITH MULTI-LEVEL POOLING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Tang, Yun
    Ding, Guohong
    Huang, Jing
    He, Xiaodong
    Zhou, Bowen
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6116 - 6120
  • [8] An Improved Deep Embedding Learning Method for Short Duration Speaker Verification
    Gao, Zhifu
    Song, Yan
    McLoughlin, Ian
    Guo, Wu
    Dai, Lirong
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3578 - 3582
  • [9] Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech
    Singh, Vishwanath Pratap
    Sahidullah, Md
    Kinnunen, Tomi
    INTERSPEECH 2023, 2023, : 1948 - 1952
  • [10] Investigation of Different Calibration Methods for Deep Speaker Embedding Based Verification Systems
    Novoselov, Sergey
    Lavrentyeva, Galina
    Volokhov, Vladimir
    Volkova, Marina
    Khmelev, Nikita
    Akulov, Artem
    SPEECH AND COMPUTER, SPECOM 2023, PT I, 2023, 14338 : 159 - 168