A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients

被引:10
|
作者
Salvati, Daniele [1 ]
Drioli, Carlo [1 ]
Foresti, Gian Luca [1 ]
机构
[1] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, I-33100 Udine, Italy
关键词
Speaker identification; Deep neural network; Convolutional neural network; Late fusion; Raw waveform; Gammatone cepstral coefficient; DATA AUGMENTATION; RECOGNITION; FILTERBANK; FEATURES; SIGNAL; NOISY; MODEL; CNNS;
D O I
10.1016/j.eswa.2023.119750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker identification aims at determining the speaker identity by analyzing his voice characteristics, and relies typically on statistical models or machine learning techniques. Frequency-domain features are by far the most used choice to encode the audio input in sound recognition. Recently, some studies have also analyzed the use of time-domain raw waveform (RW) with deep neural network (DNN) architectures. In this paper, we hypothesize that both time-domain and frequency-domain features can be used to increase the robustness of speaker identification task in adverse noisy and reverberation conditions, and we present a method based on a late fusion DNN using RWs and gammatone cepstral coefficients (GTCCs). We analyze the characteristics of RW and spectrum-based short-time features, reporting advantages and limitations, and we show that the joint use can increase the identification accuracy. The proposed late fusion DNN model consists of two independent DNN branches made primarily by convolutional neural networks (CNN) and fully connected neural networks (NN) layers. The two DNN branches have as input short-time RW audio fragments and GTCCs, respectively. The late fusion is computed on the predicted scores of the DNN branches. Since the method is based on short segments, it has the advantage of being independent from the size of the input audio signal, and the identification task can be computed by summing the predicted scores over several short-time frames. Analysis of speaker identification performance computed with simulations show that the late fusion DNN model improves the accuracy rate in adverse noise and reverberation conditions in comparison to the RW, the GTCC, and the mel-frequency cepstral coefficients (MFCCs) features. Experiments with real-world speech datasets confirm the efficiency of the proposed method, especially with small-size audio samples.
引用
下载
收藏
页数:9
相关论文
共 50 条
  • [41] Channel-robust speaker identification using Modified-Mean Cepstral Mean Normalization with Frequency Warping
    Garcia, AA
    Mammone, RJ
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 325 - 328
  • [42] Robust Neural Network for Wavefront Reconstruction Using Zernike Coefficients
    Ambrose, Adrian
    Dillon, Keith
    APPLICATIONS OF MACHINE LEARNING 2020, 2020, 11511
  • [43] Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    NEURAL INFORMATION PROCESSING, PT I, 2015, 9489 : 662 - 670
  • [44] Speaker Identification using Wavelet Shannon Entropy and Probabilistic Neural Network
    Lei, Lei
    She, Kun
    2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 566 - 571
  • [45] PPG-based human identification using Mel-frequency cepstral coefficients and neural networks
    Siam, Ali I.
    Elazm, Atef Abou
    El-Bahnasawy, Nirmeen A.
    El Banby, Ghada M.
    Abd El-Samie, Fathi E.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (17) : 26001 - 26019
  • [46] PPG-based human identification using Mel-frequency cepstral coefficients and neural networks
    Ali I. Siam
    Atef Abou Elazm
    Nirmeen A. El-Bahnasawy
    Ghada M. El Banby
    Fathi E. Abd El-Samie
    Multimedia Tools and Applications, 2021, 80 : 26001 - 26019
  • [47] Decision Level Fusion based Approach for Indian Languages Identification using Deep Neural Network
    Gupta, Kanika
    Gour, Kartikeya Singh
    Arya, Sompal
    Gangashetty, Suryakanth V.
    PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 2056 - 2059
  • [48] DEEP NEURAL NETWORK DRIVEN MIXTURE OF PLDA FOR ROBUST I-VECTOR SPEAKER VERIFICATION
    Li, Na
    Mak, Man-Wai
    Chien, Jen-Tzung
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 186 - 191
  • [49] Identification of robust deep neural network models of longitudinal clinical measurements
    Hamed Javidi
    Arshiya Mariam
    Gholamreza Khademi
    Emily C. Zabor
    Ran Zhao
    Tomas Radivoyevitch
    Daniel M. Rotroff
    npj Digital Medicine, 5
  • [50] Identification of robust deep neural network models of longitudinal clinical measurements
    Javidi, Hamed
    Mariam, Arshiya
    Khademi, Gholamreza
    Zabor, Emily C.
    Zhao, Ran
    Radivoyevitch, Tomas
    Rotroff, Daniel M.
    NPJ DIGITAL MEDICINE, 2022, 5 (01)