When Old Meets New: Emotion Recognition from Speech Signals

被引:11
|
作者
Arano, Keith April [1 ]
Gloor, Peter [2 ]
Orsenigo, Carlotta [1 ]
Vercellis, Carlo [1 ]
机构
[1] Politecn Milan, Dept Management Econ & Ind Engn, I-20156 Milan, Italy
[2] MIT, Ctr Collect Intelligence, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
Speech emotion recognition; Machine learning; Deep learning; SENTIMENT ANALYSIS; MODEL;
D O I
10.1007/s12559-021-09865-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human-machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.
引用
收藏
页码:771 / 783
页数:13
相关论文
共 50 条
  • [31] A Mobile Emotion Recognition System Based on Speech Signals and Facial Images
    Wu, Yu-Hao
    Lin, Shu-Jing
    Yang, Don-Lin
    [J]. 2013 INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC), 2013, : 212 - 217
  • [32] Cat swarm optimized ensemble technique for emotion recognition in speech signals
    Butta, Rajasekhar
    Maddu, Kamaraju
    Vangala, Sumalatha
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (27):
  • [33] Multi-modal emotion recognition using EEG and speech signals
    Wang, Qian
    Wang, Mou
    Yang, Yan
    Zhang, Xiaolei
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 149
  • [34] Emotion Recognition Based on EMD-Wavelet Analysis of Speech Signals
    Shahnaz, C.
    Sultanas, S.
    Fattah, S. A.
    Rafi, R. H. M.
    Ahmmed, I.
    Zhu, W. -P.
    Ahmad, M. O.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2015, : 307 - 310
  • [35] Adjuvant! When the new world meets the old world
    Huober, J.
    Thuerlimann, B.
    [J]. LANCET ONCOLOGY, 2009, 10 (11): : 1028 - 1029
  • [36] Emotion recognition from speech signals using digital features optimization by diversity measure fusion
    Konduru, Ashok Kumar
    Iqbal, J. L. Mazher
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (01) : 2547 - 2572
  • [37] Autologous transplant for myeloma: when the old meets the new
    Gay, Francesca
    Genuardi, Mariella
    Boccadoro, Mario
    [J]. ONCOTARGET, 2017, 8 (53) : 90618 - 90619
  • [38] Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals
    Pravena D.
    Govind D.
    [J]. Govind, D. (d_govind@cb.amrita.edu), 1600, Springer Science and Business Media, LLC (20): : 787 - 797
  • [39] WHEN FACE RECOGNITION MEETS OCCLUSION: A NEW BENCHMARK
    Huang, Baojin
    Wang, Zhongyuan
    Wang, Guangcheng
    Jiang, Kui
    Zeng, Kangli
    Han, Zhen
    Tian, Xin
    Yang, Yuhong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4240 - 4244
  • [40] Speech Emotion Recognition
    Lalitha, S.
    Madhavan, Abhishek
    Bhushan, Bharath
    Saketh, Srinivas
    [J]. 2014 INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRONICS, COMPUTERS AND COMMUNICATIONS (ICAECC), 2014,