An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引：39

作者：

Ahmed, Md. Rayhan ^{[1
]}

Islam, Salekul ^{[1
]}

Islam, A. K. M. Muzahidul ^{[1
]}

Shatabda, Swakkhar ^{[1
]}

机构：

[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 218卷

关键词：

Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;

D O I：

10.1016/j.eswa.2023.119633

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.

引用

页数：21

共 50 条

[11] Speech emotion recognition using data augmentation
Praseetha, V. M.
Joby, P. P.
[J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 25 (4) : 783 - 792
[12] Speech Emotion Recognition Model Based on Attention CNN Bi-GRU Fusing Visual Information
Hu, Zhangfang
Wang, Lan
Luo, Yuan
Xia, Yanling
Xiao, Hang
[J]. ENGINEERING LETTERS, 2022, 30 (02)
[13] Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks
Dutt, Aditya
Gader, Paul
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2043 - 2054
[14] An emotion recognition method based on EWT-3D-CNN-BiLSTM-GRU-AT model
Celebi, Muharrem
Ozturk, Sitki
Kaplan, Kaplan
[J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
[15] 1D-CNN-LSTM Hybrid-Model-Based Pet Behavior Recognition through Wearable Sensor Data Augmentation
Kim, Hyungju
Moon, Nammee
[J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2024, 20 (02): : 159 - 172
[16] Data Augmentation using GANs for Speech Emotion Recognition
Chatziagapi, Aggelina
Paraskevopoulos, Georgios
Sgouropoulos, Dimitris
Pantazopoulos, Georgios
Nikandrou, Malvina
Giannakopoulos, Theodoros
Katsamanis, Athanasios
Potamianos, Alexandros
Narayanan, Shrikanth
[J]. INTERSPEECH 2019, 2019, : 171 - 175
[17] Adversarial Data Augmentation Network for Speech Emotion Recognition
Yi, Lu
Mak, Man-Wai
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 529 - 534
[18] CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings
Iyer, Abhishek
Das, Srimit Sritik
Teotia, Reva
Maheshwari, Shishir
Sharma, Rishi Raj
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (04) : 4883 - 4896
[19] CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings
Abhishek Iyer
Srimit Sritik Das
Reva Teotia
Shishir Maheshwari
Rishi Raj Sharma
[J]. Multimedia Tools and Applications, 2023, 82 : 4883 - 4896
[20] Ensemble softmax regression model for speech emotion recognition
Yaxin Sun
Guihua Wen
[J]. Multimedia Tools and Applications, 2017, 76 : 8305 - 8328

← 1 2 3 4 5 →