A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients

被引:10
|
作者
Salvati, Daniele [1 ]
Drioli, Carlo [1 ]
Foresti, Gian Luca [1 ]
机构
[1] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, I-33100 Udine, Italy
关键词
Speaker identification; Deep neural network; Convolutional neural network; Late fusion; Raw waveform; Gammatone cepstral coefficient; DATA AUGMENTATION; RECOGNITION; FILTERBANK; FEATURES; SIGNAL; NOISY; MODEL; CNNS;
D O I
10.1016/j.eswa.2023.119750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker identification aims at determining the speaker identity by analyzing his voice characteristics, and relies typically on statistical models or machine learning techniques. Frequency-domain features are by far the most used choice to encode the audio input in sound recognition. Recently, some studies have also analyzed the use of time-domain raw waveform (RW) with deep neural network (DNN) architectures. In this paper, we hypothesize that both time-domain and frequency-domain features can be used to increase the robustness of speaker identification task in adverse noisy and reverberation conditions, and we present a method based on a late fusion DNN using RWs and gammatone cepstral coefficients (GTCCs). We analyze the characteristics of RW and spectrum-based short-time features, reporting advantages and limitations, and we show that the joint use can increase the identification accuracy. The proposed late fusion DNN model consists of two independent DNN branches made primarily by convolutional neural networks (CNN) and fully connected neural networks (NN) layers. The two DNN branches have as input short-time RW audio fragments and GTCCs, respectively. The late fusion is computed on the predicted scores of the DNN branches. Since the method is based on short segments, it has the advantage of being independent from the size of the input audio signal, and the identification task can be computed by summing the predicted scores over several short-time frames. Analysis of speaker identification performance computed with simulations show that the late fusion DNN model improves the accuracy rate in adverse noise and reverberation conditions in comparison to the RW, the GTCC, and the mel-frequency cepstral coefficients (MFCCs) features. Experiments with real-world speech datasets confirm the efficiency of the proposed method, especially with small-size audio samples.
引用
下载
收藏
页数:9
相关论文
共 50 条
  • [1] Speaker Identification Based On Gammatone Cepstral Coefficients And General Regression Neural Network
    Li, Penghua
    Hu, Fangchao
    Li, Yinguo
    Qiu, Baomei
    26TH CHINESE CONTROL AND DECISION CONFERENCE (2014 CCDC), 2014, : 751 - 756
  • [2] Gammatone Frequency Cepstral Coefficients for Speaker Identification over VoIP Networks
    Bouziane, Ayoub
    Kharroubi, Jamal
    Zarghili, Arsalane
    2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY FOR ORGANIZATIONS DEVELOPMENT (IT4OD), 2016,
  • [3] Speaker Identification Using Linear Predictive Cepstral Coefficients And General Regression Neural Network
    Li, Penghua
    Hu, Fangchao
    Li, Yinguo
    Xu, Yang
    2014 33RD CHINESE CONTROL CONFERENCE (CCC), 2014, : 4952 - 4956
  • [4] Speaker identification using Kalman cepstral coefficients
    Svenda, Z
    Radová, V
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 295 - 300
  • [5] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
    U. Kumaran
    S. Radha Rammohan
    Senthil Murugan Nagarajan
    A. Prathik
    International Journal of Speech Technology, 2021, 24 : 303 - 314
  • [6] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
    Kumaran, U.
    Radha Rammohan, S.
    Nagarajan, Senthil Murugan
    Prathik, A.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (02) : 303 - 314
  • [7] Speech Emotion Recognition Using Gammatone Cepstral Coefficients and Deep Learning Features
    Sharan, Roneel, V
    2023 IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLIED NETWORK TECHNOLOGIES, ICMLANT, 2023, : 139 - 142
  • [8] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
    Kumaran, U.
    Radha Rammohan, S.
    Nagarajan, Senthil Murugan
    Prathik, A.
    International Journal of Speech Technology, 2021, 24 (02): : 303 - 314
  • [9] A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients
    Abraham, J. V. Thomas
    Khan, A. Nayeemulla
    Shahina, A.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 26 (3) : 579 - 587
  • [10] Stationary wavelet Filtering Cepstral coefficients (SWFCC) for robust speaker identification
    Missaoui, Ibrahim
    Lachiri, Zied
    Applied Acoustics, 2025, 231