When Old Meets New: Emotion Recognition from Speech Signals

被引：11

作者：

Arano, Keith April ^{[1
]}

Gloor, Peter ^{[2
]}

Orsenigo, Carlotta ^{[1
]}

Vercellis, Carlo ^{[1
]}

机构：

[1] Politecn Milan, Dept Management Econ & Ind Engn, I-20156 Milan, Italy

[2] MIT, Ctr Collect Intelligence, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

COGNITIVE COMPUTATION | 2021年 / 13卷 / 03期

关键词：

Speech emotion recognition; Machine learning; Deep learning; SENTIMENT ANALYSIS; MODEL;

D O I：

10.1007/s12559-021-09865-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human-machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.

引用

页码：771 / 783

页数：13

共 50 条

[1] When Old Meets New: Emotion Recognition from Speech Signals
Keith April Araño
Peter Gloor
Carlotta Orsenigo
Carlo Vercellis
[J]. Cognitive Computation, 2021, 13 : 771 - 783
[2] Emotion recognition from Madarin speech signals
Pao, TL
Chen, YT
Yeh, JH
[J]. 2004 INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2004, : 301 - 304
[3] Emotion recognition from speech signals using new harmony features
Yang, B.
Lugger, M.
[J]. SIGNAL PROCESSING, 2010, 90 (05) : 1415 - 1423
[4] Emotion recognition and evaluation from Mandarin speech signals
Pao, Tsanglong
Chen, Yute
Yeh, Junheng
[J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2008, 4 (07): : 1695 - 1709
[5] Improving Automatic Emotion Recognition from Speech Signals
Bozkurt, Elif
Erzin, Engin
Erdem, Cigdem Eroglu
Erdem, A. Tanju
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 312 - +
[6] When old meets new
Gillin, Kristina
[J]. NUCLEAR ENGINEERING INTERNATIONAL, 2020, 65 (792): : 26 - 29
[7] Context-Independent Multilingual Emotion Recognition from Speech Signals
Vladimir Hozjan
Zdravko Kačič
[J]. International Journal of Speech Technology, 2003, 6 (3) : 311 - 320
[8] WHEN OLD MEETS NEW...
不详
[J]. INDUSTRIAL DIAMOND REVIEW, 1995, 55 (02): : 77 - 77
[9] Multimodal emotion recognition for the fusion of speech and EEG signals
Ma, Jianghe
Sun, Ying
Zhang, Xueying
[J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (01): : 143 - 150
[10] Gender Specific Emotion Recognition Through Speech Signals
Vinay
Gupta, Shilpi
Mehra, Anu
[J]. 2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 727 - 733

← 1 2 3 4 5 →