Neural network-based blended ensemble learning for speech emotion recognition

被引：6

作者：

Yalamanchili, Bhanusree ^{[1
]}

Samayamantula, Srinivas Kumar ^{[2
]}

Anne, Koteswara Rao ^{[3
]}

机构：

[1] VNRVJIET, Dept CSE, Hyderabad, India

[2] Jawaharlal Nehru Technol Univ, Dept ECE, Kakinada, India

[3] Kalasalingam Acad Res & Educ Tamilnadu, Dept CSE, Srivilliputhur, India

来源：

MULTIDIMENSIONAL SYSTEMS AND SIGNAL PROCESSING | 2022年 / 33卷 / 04期

关键词：

Blended ensemble learning; Log mel-spectrogram; MFCC; Speech emotion recognition; FEATURE-SELECTION; RECURRENT; FRAMEWORK; FEATURES; SYSTEM;

D O I：

10.1007/s11045-022-00845-9

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Speech Emotion Recognition (SER) identifies human emotion from short speech signals that enable natural Human Computer Interactions(HCI). Accurate emotion prediction is required to create real-time interactive applications. Inaccurate or wrong predictions may create annoying situations in real-time situations. In addition to linguistic cues, human speech signals consist of numerous hidden features like cepstral, prosodic, spectrograms etc., for determining emotions. Handling only a set of features like cepstral or prosodic or spectrogram for SER does not accurately classify emotion. Machine learning models and artificial neural network architectures used in the existing works of SER individually have their abilities to handle either temporal cues or spatial cues. This research proposes a Neural Network-based Blended Ensemble Learning (NNBEL) model. This model stacks the predictions made by individual neural networks, that are capable of handling both temporal and spatial cues. The proposed model is ensembled with state-of-the-art neural network architecture, viz. 1-Dimensional Convolution Neural Network (1D CNN), Long Short-Term Memory (LSTM), and CapsuleNets. The first two architectures are especially suitable for handling speech-like time-series data, and CapsuleNets are ideal for taking spatial speech cues. Log mel-spectrogram is fed to LSTM and Mel Frequency Cepstral Coefficients (MFCCs) are fed to 1D-CNN and CapsuleNets. The predicted emotions of each of these neural networks are fed to Multi-Layer Perceptron (MLP) at the next level for predicting final emotion. The objective of considering Blended Ensemble learning is to consider the coarse-tuning of classification in the first layer of NNBEL with models namely, 1D-CNN, LSTM, CapsuleNets, and fine-tuning of classification at the second layer with Meta Classifier, MLP. The NNBEL model and individual base models are evaluated on RAVDESS and IEMOCAP datasets. The proposed model has achieved a classification accuracy of 95.3% on RAVDESS and 94% on IEMOCAP. The performance of this model is outstanding over the base models and existing models in the literature. The confusion matrix also shows a clear improvement in distinguishing the emotions.

引用

页码：1323 / 1348

页数：26

共 50 条

[1] Neural network-based blended ensemble learning for speech emotion recognition
Bhanusree Yalamanchili
Srinivas Kumar Samayamantula
Koteswara Rao Anne
Multidimensional Systems and Signal Processing, 2022, 33 : 1323 - 1348
[2] Speech Emotion Recognition Research Based on the Stacked Generalization Ensemble Neural Network for Robot Pet
Huang, Yongming
Zhang, Guobao
Xu, Xiaoli
PROCEEDINGS OF THE 2009 CHINESE CONFERENCE ON PATTERN RECOGNITION AND THE FIRST CJK JOINT WORKSHOP ON PATTERN RECOGNITION, VOLS 1 AND 2, 2009, : 658 - 662
[3] Speech Emotion Recognition Based on Deep Neural Network
Zhu, Zijiang
Hu, Yi
Li, Junshan
Li, Jianjun
Wang, Junhua
BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 126 : 154 - 154
[4] Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition
Ai, Xusheng
Sheng, Victor S.
Fang, Wei
Ling, Charles X.
Li, Chunhua
IEEE ACCESS, 2020, 8 : 199909 - 199919
[5] Transfer Learning of Deep Neural Network for Speech Emotion Recognition
Huang, Ying
Hu, Mingqing
Yu, Xianguo
Wang, Tao
Yang, Chen
PATTERN RECOGNITION (CCPR 2016), PT II, 2016, 663 : 721 - 729
[6] Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition
Lee, Moa
Lee, Jeehye
Chang, Joon-Hyuk
DIGITAL SIGNAL PROCESSING, 2019, 85 : 1 - 9
[7] SPEECH EMOTION RECOGNITION WITH ENSEMBLE LEARNING METHODS
Shih, Po-Yuan
Chen, Chia-Ping
Wu, Chung-Hsien
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2756 - 2760
[8] Speech emotion recognition based on spiking neural network and convolutional neural network
Du, Chengyan
Liu, Fu
Kang, Bing
Hou, Tao
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
[9] Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network
Huang, Haiping
Hu, Zhenchao
Wang, Wenming
Wu, Min
IEEE ACCESS, 2020, 8 : 3265 - 3271
[10] Simulation of English speech emotion recognition based on transfer learning and CNN neural network
Chen, Xuehua
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (02) : 2349 - 2360

← 1 2 3 4 5 →