A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients

被引:10
|
作者
Salvati, Daniele [1 ]
Drioli, Carlo [1 ]
Foresti, Gian Luca [1 ]
机构
[1] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, I-33100 Udine, Italy
关键词
Speaker identification; Deep neural network; Convolutional neural network; Late fusion; Raw waveform; Gammatone cepstral coefficient; DATA AUGMENTATION; RECOGNITION; FILTERBANK; FEATURES; SIGNAL; NOISY; MODEL; CNNS;
D O I
10.1016/j.eswa.2023.119750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker identification aims at determining the speaker identity by analyzing his voice characteristics, and relies typically on statistical models or machine learning techniques. Frequency-domain features are by far the most used choice to encode the audio input in sound recognition. Recently, some studies have also analyzed the use of time-domain raw waveform (RW) with deep neural network (DNN) architectures. In this paper, we hypothesize that both time-domain and frequency-domain features can be used to increase the robustness of speaker identification task in adverse noisy and reverberation conditions, and we present a method based on a late fusion DNN using RWs and gammatone cepstral coefficients (GTCCs). We analyze the characteristics of RW and spectrum-based short-time features, reporting advantages and limitations, and we show that the joint use can increase the identification accuracy. The proposed late fusion DNN model consists of two independent DNN branches made primarily by convolutional neural networks (CNN) and fully connected neural networks (NN) layers. The two DNN branches have as input short-time RW audio fragments and GTCCs, respectively. The late fusion is computed on the predicted scores of the DNN branches. Since the method is based on short segments, it has the advantage of being independent from the size of the input audio signal, and the identification task can be computed by summing the predicted scores over several short-time frames. Analysis of speaker identification performance computed with simulations show that the late fusion DNN model improves the accuracy rate in adverse noise and reverberation conditions in comparison to the RW, the GTCC, and the mel-frequency cepstral coefficients (MFCCs) features. Experiments with real-world speech datasets confirm the efficiency of the proposed method, especially with small-size audio samples.
引用
下载
收藏
页数:9
相关论文
共 50 条
  • [31] Mel Frequency Cepstral Coefficients (MFCC) Based Speaker Identification in Noisy Environment Using Wiener Filter
    Chauhan, Paresh M.
    Desai, Nikita P.
    2014 INTERNATIONAL CONFERENCE ON GREEN COMPUTING COMMUNICATION AND ELECTRICAL ENGINEERING (ICGCCEE), 2014,
  • [32] A Study on Speaker Identification Approach by Feature Matching Algorithm using Pitch and Mel Frequency Cepstral Coefficients
    Prasetio, Barlian Henryranu
    Sakurai, Keiko
    Tamura, Hiroki
    Tanno, Koichi
    ICAROB 2019: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL LIFE AND ROBOTICS, 2019, : 475 - 478
  • [33] MODELLING SPEAKER AND CHANNEL VARIABILITY USING DEEP NEURAL NETWORKS FOR ROBUST SPEAKER VERIFICATION
    Bhattacharya, Gautam
    Alam, Jahangir
    Kenny, Patrick
    Gupta, Vishwa
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 192 - 198
  • [34] Ensemble Speaker Modeling using Speaker Adaptive Training Deep Neural Network for Speaker Adaptation
    Li, Sheng
    Lu, Xugang
    Akita, Yuya
    Kawahara, Tatsuya
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2892 - 2896
  • [35] PerceptionNet: A Deep Convolutional Neural Network for Late Sensor Fusion
    Kasnesis, Panagiotis
    Patrikakis, Charalampos Z.
    Venieris, Iakovos S.
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, 2019, 868 : 101 - 119
  • [36] Speaker diarization system using HXLPS and deep neural network
    Ramaiah, V. Subba
    Rao, R. Rajeswara
    ALEXANDRIA ENGINEERING JOURNAL, 2018, 57 (01) : 255 - 266
  • [37] Speaker identification using a hybrid neural network and conformity approach
    Ouzounov, A
    SIGNAL ANALYSIS & PREDICTION I, 1997, : 455 - 458
  • [38] A real time speaker identification using artificial neural network
    Hossain, Md. Murad
    Ahmed, Boshir
    Asrafi, Mahrnuda
    PROCEEDINGS OF 10TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2007), 2007, : 325 - 329
  • [39] Speaker Identification System Using Wavelet Transform and Neural Network
    Daqrouq, K.
    Abu Hilal, T.
    Sherif, M.
    El-Hajar, S.
    Al-Qawasmi, A.
    2009 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTATIONAL TOOLS FOR ENGINEERING APPLICATIONS, 2009, : 560 - +
  • [40] Deep Neural Network for Speaker Identification Using Static and Dynamic Prosodic Feature for Spontaneous and Dictated Data
    Rahman, Arifan
    Wibowo, Wahyu Catur
    13TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS 2021), 2021, : 145 - +