On the Use of Cross-module Attention Statistics Pooling for Speaker Verification

被引:0
|
作者
Alam, Jahangir [1 ]
Fathan, Abderrahim [1 ]
机构
[1] CRIM, Montreal, PQ, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Speaker verification; neural speaker embeddings; hybrid neural network; cross-module attention;
D O I
10.1109/IWBF57495.2023.10157564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In deep learning-based speaker verification frameworks, extraction of a speaker embedding vector plays a key role. In this contribution, we propose a hybrid neural network that employs a cross-module attention pooling mechanism for the extraction of speaker discriminant utterance-level embeddings. In particular, the proposed system incorporates a 2D-Convolution Neural Network (CNN)-based feature extraction module in cascade with a frame-level network, which is composed of a fully Time Delay Neural Network (TDNN) network and a TDNN-Long Short Term Memory (TDNN-LSTM) hybrid network in a parallel manner. The proposed system also employs cross-module attention statistics pooling for aggregating the speaker information within an utterance-level context by capturing the complementarity between two parallelly connected modules. We conduct a set of experiments on the Voxceleb corpus for evaluating the performance of the proposed system and the proposed hybrid network is able to provide better results than the conventional approaches trained on the same dataset.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] DEEP SPEAKER EMBEDDING LEARNING WITH MULTI-LEVEL POOLING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Tang, Yun
    Ding, Guohong
    Huang, Jing
    He, Xiaodong
    Zhou, Bowen
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6116 - 6120
  • [32] DOUBLE MULTI-HEAD ATTENTION FOR SPEAKER VERIFICATION
    India, Miquel
    Safari, Pooyan
    Hernando, Javier
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6144 - 6148
  • [33] Bidirectional Attention for Text-Dependent Speaker Verification
    Fang, Xin
    Gao, Tian
    Zou, Liang
    Ling, Zhenhua
    SENSORS, 2020, 20 (23) : 1 - 17
  • [34] Lightweight CNN-ViT with cross-module representational constraint for express parcel detectionLightweight CNN-ViT with cross-module representational constraint for express parcel detectionG. Zhang et al.
    Guowei Zhang
    Wuzhi Li
    Yutong Tang
    Shuixuan Chen
    Li Wang
    The Visual Computer, 2025, 41 (5) : 3283 - 3295
  • [35] Vector-Based Attentive Pooling for Text-Independent Speaker Verification
    Wu, Yanfeng
    Guo, Chenkai
    Gao, Hongcan
    Hou, Xiaolei
    Xu, Jing
    INTERSPEECH 2020, 2020, : 936 - 940
  • [36] Lightweight CNN-ViT with cross-module representational constraint for express parcel detection
    Zhang, Guowei
    Li, Wuzhi
    Tang, Yutong
    Chen, Shuixuan
    Wang, Li
    VISUAL COMPUTER, 2024,
  • [37] Adaptive Local Cross-Channel Vector Pooling Attention Module for Semantic Segmentation of Remote Sensing Imagery
    Wang, Xiaofeng
    Kang, Menglei
    Chen, Yan
    Jiang, Wenxiang
    Wang, Mengyuan
    Weise, Thomas
    Tan, Ming
    Xu, Lixiang
    Li, Xinlu
    Zou, Le
    Zhang, Chen
    REMOTE SENSING, 2023, 15 (08)
  • [38] Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
    Zhen, Kai
    Sung, Jongmo
    Lee, Mi Suk
    Beack, Seungkwon
    Kim, Minje
    INTERSPEECH 2019, 2019, : 3396 - 3400
  • [39] Threshold Re-weighting Attention Mechanism for Speaker Verification
    Li, Bo
    Cai, Xiaodong
    PROCEEDINGS OF 2018 IEEE 4TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2018), 2018, : 971 - 974
  • [40] Speaker Verification Based on Channel Attention and Adaptive Joint Loss
    Fan, Houbin
    Li, Jun
    Ge, Fengpei
    Liang, Chunyan
    ELECTRONICS, 2025, 14 (03):