MULTI-STREAM CONVOLUTIONAL NEURAL NETWORK WITH FREQUENCY SELECTION FOR ROBUST SPEAKER VERIFICATION

被引:0
|
作者
Yao, Wei [1 ]
Chen, Shen [2 ]
Cui, Jiamin [1 ]
Lou, Yaolin [1 ]
机构
[1] Zhejiang Univ Water Resources & Elect Power, Coll Elect Engn, Key Lab Technol Rural Water Management Zhejiang Pr, Hangzhou, Peoples R China
[2] Wanbang Digital Energy Co Ltd China, Hangzhou, Peoples R China
关键词
Deep learning; speaker verification; convolutional neural network; mul-; ti-stream; frequency selection;
D O I
10.31577/cai20244819
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker verification aims to verify whether an input speech corresponds to the claimed speaker, and conventionally, this kind of system is deployed based on single-stream scenario, wherein the feature extractor operates in full frequency range. In this paper, we hypothesize that machine can learn enough knowledge to do classification task when listening to partial frequency range instead of full frequency range, which is so called frequency selection technique, and further propose a novel framework of multi-stream Convolutional Neural Network (CNN) with this technique for speaker verification tasks. The proposed framework accommodates diverse temporal embeddings generated from multiple streams to enhance the robustness of acoustic modeling. For the diversity of temporal embeddings, we consider feature augmentation with frequency selection, which is to manually segment the full-band of frequency into several sub-bands, and the feature extractor of each stream can select which sub-bands to use as target frequency domain. Different from conventional single-stream solution wherein each utterance would only be processed for one time, in this framework, there are multiple streams processing it in parallel. The input utterance for each stream is pre-processed by a frequency selector within specified frequency range, and post-processed by mean normalization. The normalized temporal embeddings of each stream will flow into a pooling layer to generate fused embeddings. We conduct extensive experiments on VoxCeleb dataset, and the experimental results demonstrate that multi-stream CNN significantly outperforms single-stream baseline with 20.53% of relative improvement in minimum Decision Cost Function (minDCF) and 15.28% of relative improvement in Equal Error Rate (EER).
引用
收藏
页码:819 / 848
页数:30
相关论文
共 50 条
  • [41] A Multi-Stream Recurrent Neural Network for Social Role Detection in Multiparty Interactions
    Zhang, Lingyu
    Radke, Richard J.
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 554 - 567
  • [42] Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition
    Liu, Yinan
    Wu, Qingbo
    Tang, Liangzhi
    Shi, Hengcan
    IEEE ACCESS, 2017, 5 : 19432 - 19441
  • [43] Multi-Stream Deep Neural Network For 12-Lead ECG Classification
    Baumgartner, Martin
    Eggerth, Alphons
    Ziegl, Andreas
    Hayn, Dieter
    Schreier, Guenter
    2020 COMPUTING IN CARDIOLOGY, 2020,
  • [44] Multi-stream Information-Based Neural Network for Mammogram Mass Segmentation
    Li, Zhilin
    Deng, Zijian
    Chen, Li
    Gui, Yu
    Cai, Zhigang
    Liao, Jianwei
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT I, 2022, 13529 : 267 - 278
  • [45] Binary Neural Network for Speaker Verification
    Zhu, Tinglong
    Qin, Xiaoyi
    Li, Ming
    INTERSPEECH 2021, 2021, : 86 - 90
  • [46] Multi-Stream Convolutional Neural Network-Based Wearable, Flexible Bionic Gesture Surface Muscle Feature Extraction and Recognition
    Liu, Wansu
    Lu, Biao
    FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2022, 10
  • [47] Driver Behavior Recognition via Interwoven Deep Convolutional Neural Nets With Multi-Stream Inputs
    Zhang, Chaoyun
    Li, Rui
    Kim, Woojin
    Yoon, Daesub
    Patras, Paul
    IEEE ACCESS, 2020, 8 : 191138 - 191151
  • [48] Age Estimation From Facial Parts Using Compact Multi-Stream Convolutional Neural Networks
    Angeloni, Marcus de Assis
    Pereira, Rodrigo de Freitas
    Pedrini, Helio
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3039 - 3045
  • [49] Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings
    Zhang, Chunlei
    Koishida, Kazuhito
    Hansen, John H. L.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1633 - 1644
  • [50] Skeleton Action Recognition Based on Multi-Stream Spatial Attention Graph Convolutional SRU Network
    Zhao J.-N.
    She Q.-S.
    Meng M.
    Chen Y.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2022, 50 (07): : 1579 - 1585