Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network

被引:7
|
作者
Sun, Congshan [1 ]
Li, Haifeng [1 ]
Ma, Lin [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin, Peoples R China
来源
FRONTIERS IN PSYCHOLOGY | 2023年 / 13卷
基金
中国国家自然科学基金;
关键词
speech emotion recognition; empirical mode decomposition; mode mixing; convolutional neural networks; bidirectional gated recurrent units; EMPIRICAL MODE DECOMPOSITION; HILBERT SPECTRUM; SIGNAL; FEATURES;
D O I
10.3389/fpsyg.2022.1075624
中图分类号
B84 [心理学];
学科分类号
04 ; 0402 ;
摘要
Speech emotion recognition (SER) is the key to human-computer emotion interaction. However, the nonlinear characteristics of speech emotion are variable, complex, and subtly changing. Therefore, accurate recognition of emotions from speech remains a challenge. Empirical mode decomposition (EMD), as an effective decomposition method for nonlinear non-stationary signals, has been successfully used to analyze emotional speech signals. However, the mode mixing problem of EMD affects the performance of EMD-based methods for SER. Various improved methods for EMD have been proposed to alleviate the mode mixing problem. These improved methods still suffer from the problems of mode mixing, residual noise, and long computation time, and their main parameters cannot be set adaptively. To overcome these problems, we propose a novel SER framework, named IMEMD-CRNN, based on the combination of an improved version of the masking signal-based EMD (IMEMD) and convolutional recurrent neural network (CRNN). First, IMEMD is proposed to decompose speech. IMEMD is a novel disturbance-assisted EMD method and can determine the parameters of masking signals to the nature of signals. Second, we extract the 43-dimensional time-frequency features that can characterize the emotion from the intrinsic mode functions (IMFs) obtained by IMEMD. Finally, we input these features into a CRNN network to recognize emotions. In the CRNN, 2D convolutional neural networks (CNN) layers are used to capture nonlinear local temporal and frequency information of the emotional speech. Bidirectional gated recurrent units (BiGRU) layers are used to learn the temporal context information further. Experiments on the publicly available TESS dataset and Emo-DB dataset demonstrate the effectiveness of our proposed IMEMD-CRNN framework. The TESS dataset consists of 2,800 utterances containing seven emotions recorded by two native English speakers. The Emo-DB dataset consists of 535 utterances containing seven emotions recorded by ten native German speakers. The proposed IMEMD-CRNN framework achieves a state-of-the-art overall accuracy of 100% for the TESS dataset over seven emotions and 93.54% for the Emo-DB dataset over seven emotions. The IMEMD alleviates the mode mixing and obtains IMFs with less noise and more physical meaning with significantly improved efficiency. Our IMEMD-CRNN framework significantly improves the performance of emotion recognition.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition
    Ai, Xusheng
    Sheng, Victor S.
    Fang, Wei
    Ling, Charles X.
    Li, Chunhua
    IEEE ACCESS, 2020, 8 : 199909 - 199919
  • [22] Speech based emotion recognition by using a faster region-based convolutional neural network
    Suneetha C.
    Anitha R.
    Multimedia Tools and Applications, 2025, 84 (8) : 5205 - 5237
  • [23] A Dynamic Emotion Recognition System Based on Convolutional Feature Extraction and Recurrent Neural Network
    Yin, Yida
    Ayoub, Misbah
    Abel, Andrew
    Zhang, Haiyang
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2, 2023, 543 : 134 - 154
  • [24] Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network
    Badshah, Abdul Malik
    Ahmad, Jamil
    Rahim, Nasir
    Baik, Sung Wook
    2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), 2017, : 125 - 129
  • [25] Optimizing Speech Emotion Recognition with Hilbert Curve and convolutional neural network
    Yang, Zijun
    Zhou, Shi
    Zhang, Lifeng
    Serikawa, Seiichi
    Cognitive Robotics, 2024, 4 : 30 - 41
  • [26] Speech Emotion Recognition in Neurological Disorders Using Convolutional Neural Network
    Zisad, Sharif Noor
    Hossain, Mohammad Shahadat
    Andersson, Karl
    BRAIN INFORMATICS, BI 2020, 2020, 12241 : 287 - 296
  • [27] Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network
    Alluhaidan, Ala Saleh
    Saidani, Oumaima
    Jahangir, Rashid
    Nauman, Muhammad Asif
    Neffati, Omnia Saidani
    APPLIED SCIENCES-BASEL, 2023, 13 (08):
  • [28] Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition
    Zhang, Linjuan
    Wang, Longbiao
    Dang, Jianwu
    Guo, Lili
    Guan, Haotian
    NEURAL INFORMATION PROCESSING (ICONIP 2018), PT IV, 2018, 11304 : 62 - 71
  • [29] Automatic Speech Recognition trained with Convolutional Neural Network and predicted with Recurrent Neural Network
    Soundarya, M.
    Karthikeyan, P. R.
    Thangarasu, Gunasekar
    2023 9TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENERGY SYSTEMS, ICEES, 2023, : 41 - 45
  • [30] COMPACT CONVOLUTIONAL RECURRENT NEURAL NETWORKS VIA BINARIZATION FOR SPEECH EMOTION RECOGNITION
    Zhao, Huan
    Xiao, Yufeng
    Han, Jing
    Zhang, Zixing
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6690 - 6694