Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

被引:59
|
作者
Atila, Orhan [1 ]
Sengur, Abdulkadir [1 ]
机构
[1] Firat Univ, Dept Elect & Elect Engn, Fac Technol, TR-23119 Elazig, Turkey
关键词
Speech emotion recognition; Attention; 3D CNN-LSTM model; FRACTAL DIMENSION; FEATURE-SELECTION;
D O I
10.1016/j.apacoust.2021.108260
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, a novel approach, which is based on attention guided 3D convolutional neural networks (CNN)-long short-term memory (LSTM) model, is proposed for speech based emotion recognition. The proposed attention guided 3D CNN-LSTM model is trained in end-to-end fashion. The input speech signals are initially resampled and pre-processed for noise removing and emphasizing the high frequencies. Then, spectrogram, Mel-frequency cepstral coefficient (MFCC), cochleagram and fractal dimension methods are used to convert the input speech signals into the speech images. The obtained images are concatenated into four-dimensional volumes and used as input to the developed 28 layered attention integrated 3D CNN-LSTM model. In the 3D CNN-LSTM model, there are six 3D convolutional layers, two batch normalization (BN) layers, five Rectified Linear Unit (ReLu) layers, three 3D max pooling layers, one attention, one LSTM, one flatten and one dropout layers, and two fully connected layers. The attention layer is connected to the 3D convolution layers. Three datasets namely Ryerson Audio-Visual Database of Emotional Speech (RAVDESS), RML and SAVEE are used in the experimental works. Besides, the mixture of these datasets is also used in the experimental works. Classification accuracy, sensitivity, specificity and F1-score are used for evaluation of the developed method. The obtained results are also compared with some of the recently published results and it is seen that the proposed method outperforms the compared methods. (C) 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
    Tanberk, Senem
    Tukel, Dilek Bilgin
    [J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 39 - 47
  • [2] Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model
    Atmaja, Bagus Tris
    Akagi, Masato
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON SIGNALS AND SYSTEMS (ICSIGSYS), 2019, : 40 - 44
  • [3] AUV 3D Trajectory Prediction Based on CNN-LSTM
    Li, Juan
    Li, Wenbo
    [J]. PROCEEDINGS OF 2022 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION (IEEE ICMA 2022), 2022, : 1227 - 1232
  • [4] Non-diacritized Arabic speech recognition based on CNN-LSTM and attention-based models
    Alsayadi, Hamzah A.
    Abdelhamid, Abdelaziz A.
    Hegazy, Islam
    Fayed, Zaki T.
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (06) : 6207 - 6219
  • [5] 3D Gait Recognition Based on a CNN-LSTM Network with the Fusion of SkeGEI and DA Features
    Liu, Yu
    Jiang, Xinghao
    Sun, Tanfeng
    Xu, Ke
    [J]. 2019 16TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2019,
  • [6] Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion
    Liu, Yanlin
    Chen, Aibin
    Zhou, Guoxiong
    Yi, Jizheng
    Xiang, Jin
    Wang, Yaru
    [J]. Multimedia Tools and Applications, 2024, 83 (21) : 59839 - 59859
  • [7] ACCURATE 3D RECONSTRUCTION FROM CIRCULAR LIGHT FIELD USING CNN-LSTM
    Song, Zhengxi
    Zhu, Hao
    Wu, Qi
    Wang, Xue
    Li, Hongdong
    Wang, Qing
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [8] Vehicle Position Prediction Using Particle Filtering Based on 3D CNN-LSTM Model
    Wang, Jiaqin
    Liu, Kai
    Gong, Yi
    [J]. IEEE TRANSACTIONS ON MOBILE COMPUTING, 2024, 23 (04) : 2992 - 3004
  • [9] Attention-Based Dense LSTM for Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Zhao, Li
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (07): : 1426 - 1429
  • [10] Siamese Attention-Based LSTM for Speech Emotion Recognition
    Nizamidin, Tashpolat
    Zhao, Li
    Liang, Ruiyu
    Xie, Yue
    Hamdulla, Askar
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2020, E103A (07) : 937 - 941