A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients

被引：10

作者：

Salvati, Daniele ^{[1
]}

Drioli, Carlo ^{[1
]}

Foresti, Gian Luca ^{[1
]}

机构：

[1] Univ Udine, Dept Math Comp Sci & Phys, Via Sci 206, I-33100 Udine, Italy

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 222卷

关键词：

Speaker identification; Deep neural network; Convolutional neural network; Late fusion; Raw waveform; Gammatone cepstral coefficient; DATA AUGMENTATION; RECOGNITION; FILTERBANK; FEATURES; SIGNAL; NOISY; MODEL; CNNS;

D O I：

10.1016/j.eswa.2023.119750

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speaker identification aims at determining the speaker identity by analyzing his voice characteristics, and relies typically on statistical models or machine learning techniques. Frequency-domain features are by far the most used choice to encode the audio input in sound recognition. Recently, some studies have also analyzed the use of time-domain raw waveform (RW) with deep neural network (DNN) architectures. In this paper, we hypothesize that both time-domain and frequency-domain features can be used to increase the robustness of speaker identification task in adverse noisy and reverberation conditions, and we present a method based on a late fusion DNN using RWs and gammatone cepstral coefficients (GTCCs). We analyze the characteristics of RW and spectrum-based short-time features, reporting advantages and limitations, and we show that the joint use can increase the identification accuracy. The proposed late fusion DNN model consists of two independent DNN branches made primarily by convolutional neural networks (CNN) and fully connected neural networks (NN) layers. The two DNN branches have as input short-time RW audio fragments and GTCCs, respectively. The late fusion is computed on the predicted scores of the DNN branches. Since the method is based on short segments, it has the advantage of being independent from the size of the input audio signal, and the identification task can be computed by summing the predicted scores over several short-time frames. Analysis of speaker identification performance computed with simulations show that the late fusion DNN model improves the accuracy rate in adverse noise and reverberation conditions in comparison to the RW, the GTCC, and the mel-frequency cepstral coefficients (MFCCs) features. Experiments with real-world speech datasets confirm the efficiency of the proposed method, especially with small-size audio samples.

引用

下载

页数：9

共 50 条

[1] Speaker Identification Based On Gammatone Cepstral Coefficients And General Regression Neural Network
Li, Penghua
Hu, Fangchao
Li, Yinguo
Qiu, Baomei
26TH CHINESE CONTROL AND DECISION CONFERENCE (2014 CCDC), 2014, : 751 - 756
[2] Gammatone Frequency Cepstral Coefficients for Speaker Identification over VoIP Networks
Bouziane, Ayoub
Kharroubi, Jamal
Zarghili, Arsalane
2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY FOR ORGANIZATIONS DEVELOPMENT (IT4OD), 2016,
[3] Speaker Identification Using Linear Predictive Cepstral Coefficients And General Regression Neural Network
Li, Penghua
Hu, Fangchao
Li, Yinguo
Xu, Yang
2014 33RD CHINESE CONTROL CONFERENCE (CCC), 2014, : 4952 - 4956
[4] Speaker identification using Kalman cepstral coefficients
Svenda, Z
Radová, V
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 295 - 300
[5] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
U. Kumaran
S. Radha Rammohan
Senthil Murugan Nagarajan
A. Prathik
International Journal of Speech Technology, 2021, 24 : 303 - 314
[6] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
Kumaran, U.
Radha Rammohan, S.
Nagarajan, Senthil Murugan
Prathik, A.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (02) : 303 - 314
[7] Speech Emotion Recognition Using Gammatone Cepstral Coefficients and Deep Learning Features
Sharan, Roneel, V
2023 IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLIED NETWORK TECHNOLOGIES, ICMLANT, 2023, : 139 - 142
[8] Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
Kumaran, U.
Radha Rammohan, S.
Nagarajan, Senthil Murugan
Prathik, A.
International Journal of Speech Technology, 2021, 24 (02): : 303 - 314
[9] A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients
Abraham, J. V. Thomas
Khan, A. Nayeemulla
Shahina, A.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 26 (3) : 579 - 587
[10] Stationary wavelet Filtering Cepstral coefficients (SWFCC) for robust speaker identification
Missaoui, Ibrahim
Lachiri, Zied
Applied Acoustics, 2025, 231

← 1 2 3 4 5 →