Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

被引:169
|
作者
Meng, Hao [1 ]
Yan, Tianhao [1 ]
Yuan, Fei [1 ]
Wei, Hongwei [1 ]
机构
[1] Harbin Engn Univ, Inst Robot & Intelligent Control, Coll Automat, Harbin 150001, Heilongjiang, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
3-D Log-Mel; dilated CNN; residual block; center loss; BiLSTM; attention mechanism; FEATURES;
D O I
10.1109/ACCESS.2019.2938007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper, we introduce a novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Dilated CNN can assist the model to acquire more receptive fields than using the pooling layer. Then, the skip connection can keep more historic info from the shallow layer and BiLSTM layer are adopted to learn long-term dependencies from the learned local features. And we utilize the attention mechanism to enhance further extraction of speech features. Furthermore, we improve the loss function to apply softmax together with the center loss that achieves better classification performance. As emotional dialogues are transformed of the spectrograms, we pick up the values of the 3-D Log-Mel spectrums from raw signals and put them into our proposed algorithm and obtain a notable performance to get the 74.96% unweighted accuracy in the speaker-dependent and the 69.32% unweighted accuracy in the speaker-independent experiment. It is better than the 64.74% from previous state-of-the-art methods in the spontaneous emotional speech of the IEMOCAP database. In addition, we propose the networks that achieve recognition accuracies of 90.78% and 85.39% on Berlin EMODB of speaker-dependent and speaker-independent experiment respectively, which are better than the accuracy of 88.30% and 82.82% obtained by previous work. For validating the robustness and generalization, we also make an experiment for cross-corpus between above databases and get the preferable 63.84% recognition accuracy in final.
引用
收藏
页码:125868 / 125881
页数:14
相关论文
共 50 条
  • [1] On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition
    Mukhamediya, Azamat
    Fazli, Siamac
    Zollanvari, Amin
    [J]. IEEE ACCESS, 2023, 11 : 61950 - 61957
  • [2] Speech-Based Emotion Analysis Using Log-Mel Spectrograms and MFCC Features
    Yetkin, Ahmet Kemal
    Kose, Hatice
    [J]. 2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,
  • [3] Emotion recognition from speech using deep learning on spectrograms
    Li, Xingguang
    Song, Wenjun
    Liang, Zonglin
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (03) : 2791 - 2796
  • [4] Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network
    Badshah, Abdul Malik
    Ahmad, Jamil
    Rahim, Nasir
    Baik, Sung Wook
    [J]. 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), 2017, : 125 - 129
  • [5] Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms
    Satt, Aharon
    Rozenberg, Shai
    Hoory, Ron
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1089 - 1093
  • [6] Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms
    Ma, Xi
    Wu, Zhiyong
    Jia, Jia
    Xu, Mingxing
    Meng, Helen
    Cai, Lianhong
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3683 - 3687
  • [7] AN EXPLORATION OF LOG-MEL SPECTROGRAM AND MFCC FEATURES FOR ALZHEIMER'S DEMENTIA RECOGNITION FROM SPONTANEOUS SPEECH
    Meghanani, Amit
    Anoop, C. S.
    Ramakrishnan, A. G.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 670 - 677
  • [8] Leveraged Mel Spectrograms Using Harmonic and Percussive Components in Speech Emotion Recognition
    Rudd, David Hason
    Huo, Huan
    Xu, Guandong
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT II, 2022, 13281 : 392 - 404
  • [9] Recognition of Emotion with Intensity from Speech Signal Using 3D Transformed Feature and Deep Learning
    Islam, Md Riadul
    Akhand, M. A. H.
    Kamal, Md Abdus Samad
    Yamada, Kou
    [J]. ELECTRONICS, 2022, 11 (15)
  • [10] Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram
    Minh Tuan Nguyen
    Wei Wen Lin
    Jin H. Huang
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 344 - 360