Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets

被引:8
|
作者
Zielonka, Marta [1 ]
Piastowski, Artur [1 ]
Czyzewski, Andrzej [1 ]
Nadachowski, Pawel [1 ]
Operlejn, Maksymilian [1 ]
Kaczor, Kamil [1 ]
机构
[1] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Gabriela Narutowicza 11-12, PL-80233 Gdansk, Poland
关键词
speech emotion recognition; SER; machine learning; artificial intelligence; classification; convolutional neural networks;
D O I
10.3390/electronics11223831
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are in this context. The conducted studies demonstrated that mel-spectrograms are a better-suited data type for training CNN-based speech emotion recognition (SER). The research experiments employed five popular datasets: Crowdsourced Emotional Multimodal Actors Dataset (CREMA-D), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Surrey Audio-Visual Expressed Emotion (SAVEE), Toronto Emotional Speech Set (TESS), and The Interactive Emotional Dyadic Motion Capture (IEMOCAP). Six different classes of emotions were used: happiness, anger, sadness, fear, disgust, and neutral. However, some experiments were prepared to recognize just four emotions due to the characteristics of the IEMOCAP dataset. A comparison of classification efficiency on different datasets and an attempt to develop a universal model trained using all datasets were also performed. This approach brought an accuracy of 55.89% when recognizing four emotions. The most accurate model for six emotion recognition was trained and achieved 57.42% accuracy on a combination of four datasets (CREMA-D, RAVDESS, SAVEE, TESS). What is more, another study was developed that demonstrated that improper data division for training and test sets significantly influences the test accuracy of CNNs. Therefore, the problem of inappropriate data division between the training and test sets, which affected the results of studies known from the literature, was addressed extensively. The performed experiments employed the popular ResNet18 architecture to demonstrate the reliability of the research results and to show that these problems are not unique to the custom CNN architecture proposed in experiments. Subsequently, the label correctness of the CREMA-D dataset was studied through the employment of a prepared questionnaire.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [2] SPEECH EMOTION RECOGNITION USING QUATERNION CONVOLUTIONAL NEURAL NETWORKS
    Muppidi, Aneesh
    Radfar, Martin
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6309 - 6313
  • [3] Speech Emotion Recognition using Convolutional and Recurrent Neural Networks
    Lim, Wootaek
    Jang, Daeyoung
    Lee, Taejin
    [J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [4] Speech Recognition of Punjabi Numerals Using Convolutional Neural Networks
    Aditi, Thakur
    Karun, Verma
    [J]. ADVANCES IN COMPUTER COMMUNICATION AND COMPUTATIONAL SCIENCES, VOL 1, 2019, 759 : 61 - 69
  • [5] Continuous speech recognition by convolutional neural networks
    Zhang, Qing-Qing
    Liu, Yong
    Pan, Jie-Lin
    Yan, Yong-Hong
    [J]. Gongcheng Kexue Xuebao/Chinese Journal of Engineering, 2015, 37 (09): : 1212 - 1217
  • [6] Convolutional Neural Networks for Distant Speech Recognition
    Swietojanski, Pawel
    Ghoshal, Arnab
    Renals, Steve
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1120 - 1124
  • [7] Speech Recognition Based on Convolutional Neural Networks
    Du Guiming
    Wang Xia
    Wang Guangyan
    Zhang Yan
    Li Dan
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2016, : 708 - 711
  • [8] AN ANALYSIS OF CONVOLUTIONAL NEURAL NETWORKS FOR SPEECH RECOGNITION
    Huang, Jui-Ting
    Li, Jinyu
    Gong, Yifan
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4989 - 4993
  • [9] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
    Wani, Taiba Majid
    Gunawan, Teddy Surya
    Qadri, Syed Asif Ahmad
    Mansor, Hasmah
    Kartiwi, Mira
    Ismail, Nanang
    [J]. PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
  • [10] Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech
    Simic, Nikola
    Suzic, Sinisa
    Nosek, Tijana
    Vujovic, Mia
    Peric, Zoran
    Savic, Milan
    Delic, Vlado
    [J]. ENTROPY, 2022, 24 (03)