A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients

被引:10
|
作者
Salvati, Daniele [1 ]
Drioli, Carlo [1 ]
Foresti, Gian Luca [1 ]
机构
[1] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, I-33100 Udine, Italy
关键词
Speaker identification; Deep neural network; Convolutional neural network; Late fusion; Raw waveform; Gammatone cepstral coefficient; DATA AUGMENTATION; RECOGNITION; FILTERBANK; FEATURES; SIGNAL; NOISY; MODEL; CNNS;
D O I
10.1016/j.eswa.2023.119750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker identification aims at determining the speaker identity by analyzing his voice characteristics, and relies typically on statistical models or machine learning techniques. Frequency-domain features are by far the most used choice to encode the audio input in sound recognition. Recently, some studies have also analyzed the use of time-domain raw waveform (RW) with deep neural network (DNN) architectures. In this paper, we hypothesize that both time-domain and frequency-domain features can be used to increase the robustness of speaker identification task in adverse noisy and reverberation conditions, and we present a method based on a late fusion DNN using RWs and gammatone cepstral coefficients (GTCCs). We analyze the characteristics of RW and spectrum-based short-time features, reporting advantages and limitations, and we show that the joint use can increase the identification accuracy. The proposed late fusion DNN model consists of two independent DNN branches made primarily by convolutional neural networks (CNN) and fully connected neural networks (NN) layers. The two DNN branches have as input short-time RW audio fragments and GTCCs, respectively. The late fusion is computed on the predicted scores of the DNN branches. Since the method is based on short segments, it has the advantage of being independent from the size of the input audio signal, and the identification task can be computed by summing the predicted scores over several short-time frames. Analysis of speaker identification performance computed with simulations show that the late fusion DNN model improves the accuracy rate in adverse noise and reverberation conditions in comparison to the RW, the GTCC, and the mel-frequency cepstral coefficients (MFCCs) features. Experiments with real-world speech datasets confirm the efficiency of the proposed method, especially with small-size audio samples.
引用
下载
收藏
页数:9
相关论文
共 50 条
  • [21] Identification of Pathogenic Viruses Using Genomic Cepstral Coefficients with Radial Basis Function Neural Network
    Adetiba, Emmanuel
    Olugbara, Oludayo O.
    Taiwo, Tunmike B.
    ADVANCES IN NATURE AND BIOLOGICALLY INSPIRED COMPUTING, 2016, 419 : 281 - 291
  • [22] Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network
    Jahangir, Rashid
    TEh, Ying Wah
    Memon, Nisar Ahmed
    Mujtaba, Ghulam
    Zareei, Mahdi
    Ishtiaq, Uzair
    Akhtar, Muhammad Zaheer
    Ali, Ihsan
    IEEE ACCESS, 2020, 8 : 32187 - 32202
  • [23] Further Results on Speaker Identification Using Robust Speech Detection and a Neural Network
    Ouzounov, Atanas
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2009, 9 (01) : 37 - 45
  • [24] Bionic Cepstral coefficients (BCC): A new auditory feature extraction to noise-robust speaker identification
    Zouhir, Youssef
    Zarka, Mohamed
    Ouni, Kais
    APPLIED ACOUSTICS, 2024, 221
  • [25] Modified Mel-frequency Cepstral Coefficients (MMFCC) in Robust Text-dependent Speaker Identification
    Islam, Md. Atiqul
    2017 4TH INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL ENGINEERING (ICAEE), 2017, : 505 - 509
  • [26] Speaker recognition method based on deep residual network and improved Power Normalized Cepstral Coefficients features
    He, Runhua
    Li, Pan
    Li, Xuemei
    Chen, Shuhang
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VIRTUAL REALITY, AND VISUALIZATION (AIVRV 2021), 2021, 12153
  • [27] NATIVE LANGUAGE IDENTIFICATION FROM RAW WAVEFORMS USING DEEP CONVOLUTIONAL NEURAL NETWORKS WITH ATTENTIVE POOLING
    Ubale, Rutuja
    Ramanarayanan, Vikram
    Qian, Yao
    Evanini, Keelan
    Leong, Chee Wee
    Lee, Chong Min
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 403 - 410
  • [28] Speaker Identification System based on PLP Coefficients and Artificial Neural Network
    Chelali, Fatma Zohra
    Djeradi, Amar
    Djeradi, Rachida
    WORLD CONGRESS ON ENGINEERING, WCE 2011, VOL II, 2011, : 1641 - 1646
  • [29] SPEAKER DIARIZATION USING DEEP NEURAL NETWORK EMBEDDINGS
    Garcia-Romero, Daniel
    Snyder, David
    Sell, Gregory
    Povey, Daniel
    McCree, Alan
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4930 - 4934
  • [30] Enhancement in Speaker Identification through Feature Fusion using Advanced Dilated Convolution Neural Network
    Pentapati, Hema Kumar
    Sridevi, K.
    INTERNATIONAL JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS, 2023, 14 (03) : 301 - 310