An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引：39

作者：

Ahmed, Md. Rayhan ^{[1
]}

Islam, Salekul ^{[1
]}

Islam, A. K. M. Muzahidul ^{[1
]}

Shatabda, Swakkhar ^{[1
]}

机构：

[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 218卷

关键词：

Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;

D O I：

10.1016/j.eswa.2023.119633

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.

引用

页数：21

共 50 条

[1] Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
Pan, Shing-Tai
Wu, Han-Jui
[J]. ELECTRONICS, 2023, 12 (11)
[2] Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
Tanberk, Senem
Tukel, Dilek Bilgin
[J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 39 - 47
[3] Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition
Flower, Thomas Mary Little
Jaya, Thirasama
Singh, Sreedharan Christopher Ezhil
[J]. AUTOMATIKA, 2024, 65 (04) : 1325 - 1338
[4] OVERSAMPLING TECHNIQUE-BASED DATA AUGMENTATION AND 1D-CNN AND BIDIRECTIONAL GRU ENSEMBLE MODEL FOR HUMAN ACTIVITY RECOGNITION
Kim, Yeon Wook
Cho, Woo Hyeong
Kim, Kyu Sung
Lee, Sangmin
[J]. JOURNAL OF MECHANICS IN MEDICINE AND BIOLOGY, 2022, 22 (09)
[5] Speech emotion recognition using deep 1D & 2D CNN LSTM networks
Zhao, Jianfeng
Mao, Xia
Chen, Lijiang
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 312 - 323
[6] A novel concatenated 1D-CNN model for speech emotion recognition
Flower, T. Mary Little
Jaya, T.
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 93
[7] A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
Tu, Zhongwen
Liu, Bin
Zhao, Wei
Yan, Raoxin
Zou, Yang
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (07):
[8] Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition
Atila, Orhan
Sengur, Abdulkadir
[J]. APPLIED ACOUSTICS, 2021, 182
[9] Speech Emotion Recognition Using Data Augmentation
Kapoor, Tanisha
Ganguly, Arnaja
Rajeswari, D.
[J]. 2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
[10] Speech emotion recognition using data augmentation
V. M. Praseetha
P. P. Joby
[J]. International Journal of Speech Technology, 2022, 25 : 783 - 792

← 1 2 3 4 5 →