Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

被引:5
|
作者
Pan, Shing-Tai [1 ]
Wu, Han-Jui [1 ]
机构
[1] Natl Univ Kaohsiung, Dept Comp Sci & Informat Engn, Kaohsiung 811, Taiwan
关键词
speech emotion recognition; one-dimensional neural network; LSTM; CNN; MFCCs;
D O I
10.3390/electronics12112436
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users' emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition
    Ahmed, Md. Rayhan
    Islam, Salekul
    Islam, A. K. M. Muzahidul
    Shatabda, Swakkhar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 218
  • [2] Speech emotion recognition using deep 1D & 2D CNN LSTM networks
    Zhao, Jianfeng
    Mao, Xia
    Chen, Lijiang
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 312 - 323
  • [3] Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks
    Dutt, Aditya
    Gader, Paul
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2043 - 2054
  • [4] Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition
    Flower, Thomas Mary Little
    Jaya, Thirasama
    Singh, Sreedharan Christopher Ezhil
    [J]. AUTOMATIKA, 2024, 65 (04) : 1325 - 1338
  • [5] A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
    Paraskevopoulou, Georgia
    Spyrou, Evaggelos
    Perantonis, Stavros
    [J]. SIGMAP: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS, 2022, : 61 - 69
  • [6] Speech Emotion Recognition Using Data Augmentation
    Kapoor, Tanisha
    Ganguly, Arnaja
    Rajeswari, D.
    [J]. 2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
  • [7] Speech emotion recognition using data augmentation
    V. M. Praseetha
    P. P. Joby
    [J]. International Journal of Speech Technology, 2022, 25 : 783 - 792
  • [8] Speech emotion recognition using data augmentation
    Praseetha, V. M.
    Joby, P. P.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 25 (4) : 783 - 792
  • [9] Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
    Tanberk, Senem
    Tukel, Dilek Bilgin
    [J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 39 - 47
  • [10] Data Augmentation using GANs for Speech Emotion Recognition
    Chatziagapi, Aggelina
    Paraskevopoulos, Georgios
    Sgouropoulos, Dimitris
    Pantazopoulos, Georgios
    Nikandrou, Malvina
    Giannakopoulos, Theodoros
    Katsamanis, Athanasios
    Potamianos, Alexandros
    Narayanan, Shrikanth
    [J]. INTERSPEECH 2019, 2019, : 171 - 175