An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引:39
|
作者
Ahmed, Md. Rayhan [1 ]
Islam, Salekul [1 ]
Islam, A. K. M. Muzahidul [1 ]
Shatabda, Swakkhar [1 ]
机构
[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
关键词
Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;
D O I
10.1016/j.eswa.2023.119633
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.
引用
收藏
页数:21
相关论文
共 50 条
  • [11] Speech emotion recognition using data augmentation
    Praseetha, V. M.
    Joby, P. P.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 25 (4) : 783 - 792
  • [12] Speech Emotion Recognition Model Based on Attention CNN Bi-GRU Fusing Visual Information
    Hu, Zhangfang
    Wang, Lan
    Luo, Yuan
    Xia, Yanling
    Xiao, Hang
    [J]. ENGINEERING LETTERS, 2022, 30 (02)
  • [13] Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks
    Dutt, Aditya
    Gader, Paul
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2043 - 2054
  • [14] An emotion recognition method based on EWT-3D-CNN-BiLSTM-GRU-AT model
    Celebi, Muharrem
    Ozturk, Sitki
    Kaplan, Kaplan
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
  • [15] 1D-CNN-LSTM Hybrid-Model-Based Pet Behavior Recognition through Wearable Sensor Data Augmentation
    Kim, Hyungju
    Moon, Nammee
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2024, 20 (02): : 159 - 172
  • [16] Data Augmentation using GANs for Speech Emotion Recognition
    Chatziagapi, Aggelina
    Paraskevopoulos, Georgios
    Sgouropoulos, Dimitris
    Pantazopoulos, Georgios
    Nikandrou, Malvina
    Giannakopoulos, Theodoros
    Katsamanis, Athanasios
    Potamianos, Alexandros
    Narayanan, Shrikanth
    [J]. INTERSPEECH 2019, 2019, : 171 - 175
  • [17] Adversarial Data Augmentation Network for Speech Emotion Recognition
    Yi, Lu
    Mak, Man-Wai
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 529 - 534
  • [18] CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings
    Iyer, Abhishek
    Das, Srimit Sritik
    Teotia, Reva
    Maheshwari, Shishir
    Sharma, Rishi Raj
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (04) : 4883 - 4896
  • [19] CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings
    Abhishek Iyer
    Srimit Sritik Das
    Reva Teotia
    Shishir Maheshwari
    Rishi Raj Sharma
    [J]. Multimedia Tools and Applications, 2023, 82 : 4883 - 4896
  • [20] Ensemble softmax regression model for speech emotion recognition
    Yaxin Sun
    Guihua Wen
    [J]. Multimedia Tools and Applications, 2017, 76 : 8305 - 8328