Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

被引：5

作者：

Pan, Shing-Tai ^{[1
]}

Wu, Han-Jui ^{[1
]}

机构：

[1] Natl Univ Kaohsiung, Dept Comp Sci & Informat Engn, Kaohsiung 811, Taiwan

来源：

ELECTRONICS | 2023年 / 12卷 / 11期

关键词：

speech emotion recognition; one-dimensional neural network; LSTM; CNN; MFCCs;

D O I：

10.3390/electronics12112436

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users' emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.

引用

页数：21

共 50 条

[1] An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition
Ahmed, Md. Rayhan
Islam, Salekul
Islam, A. K. M. Muzahidul
Shatabda, Swakkhar
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 218
[2] Speech emotion recognition using deep 1D & 2D CNN LSTM networks
Zhao, Jianfeng
Mao, Xia
Chen, Lijiang
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 312 - 323
[3] Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks
Dutt, Aditya
Gader, Paul
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2043 - 2054
[4] Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition
Flower, Thomas Mary Little
Jaya, Thirasama
Singh, Sreedharan Christopher Ezhil
[J]. AUTOMATIKA, 2024, 65 (04) : 1325 - 1338
[5] A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
Paraskevopoulou, Georgia
Spyrou, Evaggelos
Perantonis, Stavros
[J]. SIGMAP: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS, 2022, : 61 - 69
[6] Speech Emotion Recognition Using Data Augmentation
Kapoor, Tanisha
Ganguly, Arnaja
Rajeswari, D.
[J]. 2024 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND APPLIED INFORMATICS, ACCAI 2024, 2024,
[7] Speech emotion recognition using data augmentation
V. M. Praseetha
P. P. Joby
[J]. International Journal of Speech Technology, 2022, 25 : 783 - 792
[8] Speech emotion recognition using data augmentation
Praseetha, V. M.
Joby, P. P.
[J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 25 (4) : 783 - 792
[9] Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
Tanberk, Senem
Tukel, Dilek Bilgin
[J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 39 - 47
[10] Data Augmentation using GANs for Speech Emotion Recognition
Chatziagapi, Aggelina
Paraskevopoulos, Georgios
Sgouropoulos, Dimitris
Pantazopoulos, Georgios
Nikandrou, Malvina
Giannakopoulos, Theodoros
Katsamanis, Athanasios
Potamianos, Alexandros
Narayanan, Shrikanth
[J]. INTERSPEECH 2019, 2019, : 171 - 175

← 1 2 3 4 5 →