Neural network-based blended ensemble learning for speech emotion recognition

被引:6
|
作者
Yalamanchili, Bhanusree [1 ]
Samayamantula, Srinivas Kumar [2 ]
Anne, Koteswara Rao [3 ]
机构
[1] VNRVJIET, Dept CSE, Hyderabad, India
[2] Jawaharlal Nehru Technol Univ, Dept ECE, Kakinada, India
[3] Kalasalingam Acad Res & Educ Tamilnadu, Dept CSE, Srivilliputhur, India
关键词
Blended ensemble learning; Log mel-spectrogram; MFCC; Speech emotion recognition; FEATURE-SELECTION; RECURRENT; FRAMEWORK; FEATURES; SYSTEM;
D O I
10.1007/s11045-022-00845-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech Emotion Recognition (SER) identifies human emotion from short speech signals that enable natural Human Computer Interactions(HCI). Accurate emotion prediction is required to create real-time interactive applications. Inaccurate or wrong predictions may create annoying situations in real-time situations. In addition to linguistic cues, human speech signals consist of numerous hidden features like cepstral, prosodic, spectrograms etc., for determining emotions. Handling only a set of features like cepstral or prosodic or spectrogram for SER does not accurately classify emotion. Machine learning models and artificial neural network architectures used in the existing works of SER individually have their abilities to handle either temporal cues or spatial cues. This research proposes a Neural Network-based Blended Ensemble Learning (NNBEL) model. This model stacks the predictions made by individual neural networks, that are capable of handling both temporal and spatial cues. The proposed model is ensembled with state-of-the-art neural network architecture, viz. 1-Dimensional Convolution Neural Network (1D CNN), Long Short-Term Memory (LSTM), and CapsuleNets. The first two architectures are especially suitable for handling speech-like time-series data, and CapsuleNets are ideal for taking spatial speech cues. Log mel-spectrogram is fed to LSTM and Mel Frequency Cepstral Coefficients (MFCCs) are fed to 1D-CNN and CapsuleNets. The predicted emotions of each of these neural networks are fed to Multi-Layer Perceptron (MLP) at the next level for predicting final emotion. The objective of considering Blended Ensemble learning is to consider the coarse-tuning of classification in the first layer of NNBEL with models namely, 1D-CNN, LSTM, CapsuleNets, and fine-tuning of classification at the second layer with Meta Classifier, MLP. The NNBEL model and individual base models are evaluated on RAVDESS and IEMOCAP datasets. The proposed model has achieved a classification accuracy of 95.3% on RAVDESS and 94% on IEMOCAP. The performance of this model is outstanding over the base models and existing models in the literature. The confusion matrix also shows a clear improvement in distinguishing the emotions.
引用
收藏
页码:1323 / 1348
页数:26
相关论文
共 50 条
  • [1] Neural network-based blended ensemble learning for speech emotion recognition
    Bhanusree Yalamanchili
    Srinivas Kumar Samayamantula
    Koteswara Rao Anne
    Multidimensional Systems and Signal Processing, 2022, 33 : 1323 - 1348
  • [2] Speech Emotion Recognition Research Based on the Stacked Generalization Ensemble Neural Network for Robot Pet
    Huang, Yongming
    Zhang, Guobao
    Xu, Xiaoli
    PROCEEDINGS OF THE 2009 CHINESE CONFERENCE ON PATTERN RECOGNITION AND THE FIRST CJK JOINT WORKSHOP ON PATTERN RECOGNITION, VOLS 1 AND 2, 2009, : 658 - 662
  • [3] Speech Emotion Recognition Based on Deep Neural Network
    Zhu, Zijiang
    Hu, Yi
    Li, Junshan
    Li, Jianjun
    Wang, Junhua
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 126 : 154 - 154
  • [4] Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition
    Ai, Xusheng
    Sheng, Victor S.
    Fang, Wei
    Ling, Charles X.
    Li, Chunhua
    IEEE ACCESS, 2020, 8 : 199909 - 199919
  • [5] Transfer Learning of Deep Neural Network for Speech Emotion Recognition
    Huang, Ying
    Hu, Mingqing
    Yu, Xianguo
    Wang, Tao
    Yang, Chen
    PATTERN RECOGNITION (CCPR 2016), PT II, 2016, 663 : 721 - 729
  • [6] Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition
    Lee, Moa
    Lee, Jeehye
    Chang, Joon-Hyuk
    DIGITAL SIGNAL PROCESSING, 2019, 85 : 1 - 9
  • [7] SPEECH EMOTION RECOGNITION WITH ENSEMBLE LEARNING METHODS
    Shih, Po-Yuan
    Chen, Chia-Ping
    Wu, Chung-Hsien
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2756 - 2760
  • [8] Speech emotion recognition based on spiking neural network and convolutional neural network
    Du, Chengyan
    Liu, Fu
    Kang, Bing
    Hou, Tao
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
  • [9] Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network
    Huang, Haiping
    Hu, Zhenchao
    Wang, Wenming
    Wu, Min
    IEEE ACCESS, 2020, 8 : 3265 - 3271
  • [10] Simulation of English speech emotion recognition based on transfer learning and CNN neural network
    Chen, Xuehua
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (02) : 2349 - 2360