A BiLSTM-Transformer and 2D CNN Architecture for Emotion Recognition from Speech

被引:4
|
作者
Kim, Sera [1 ]
Lee, Seok-Pil [2 ]
机构
[1] Sangmyung Univ, Grad Sch, Dept Comp Sci, Seoul 03016, South Korea
[2] Sangmyung Univ, Dept Intelligent IoT, Seoul 03016, South Korea
关键词
emotion recognition from speech; transformer; attention mechanism; bidirectional LSTM; convolutional neural network; audio feature extraction;
D O I
10.3390/electronics12194034
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)-Transformer and a 2D convolutional neural network (CNN). The BiLSTM-Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.
引用
收藏
页数:14
相关论文
共 50 条
  • [2] A Combined CNN Architecture for Speech Emotion Recognition
    Begazo, Rolinson
    Aguilera, Ana
    Dongo, Irvin
    Cardinale, Yudith
    [J]. SENSORS, 2024, 24 (17)
  • [3] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
    Dongdong Li
    Linyu Sun
    Xinlei Xu
    Zhe Wang
    Jing Zhang
    Wenli Du
    [J]. Neural Processing Letters, 2021, 53 : 4097 - 4115
  • [4] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
    Li, Dongdong
    Sun, Linyu
    Xu, Xinlei
    Wang, Zhe
    Zhang, Jing
    Du, Wenli
    [J]. NEURAL PROCESSING LETTERS, 2021, 53 (06) : 4097 - 4115
  • [5] Speech emotion recognition using deep 1D & 2D CNN LSTM networks
    Zhao, Jianfeng
    Mao, Xia
    Chen, Lijiang
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 312 - 323
  • [6] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Swami Mishra
    Nehal Bhatnagar
    Prakasam P
    Sureshkumar T. R
    [J]. Multimedia Tools and Applications, 2024, 83 : 37603 - 37620
  • [7] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Mishra, Swami
    Bhatnagar, Nehal
    Prakasam, P.
    Sureshkumar, T. R.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37603 - 37620
  • [8] Simplified 2D CNN Architecture With Channel Selection for Emotion Recognition Using EEG Spectrogram
    Farokhah, Lia
    Sarno, Riyanarto
    Fatichah, Chastine
    [J]. IEEE ACCESS, 2023, 11 : 46330 - 46343
  • [9] Hybrid Time Distributed CNN-transformer for Speech Emotion Recognition
    Slimi, Anwer
    Nicolas, Henri
    Zrigui, Mounir
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES (ICSOFT), 2022, : 602 - 611
  • [10] Transformer-CNN Automatic Hyperparameter Tuning for Speech Emotion Recognition
    Gumelar, Agustinus Bimo
    Yuniarno, Eko Mulyanto
    Adi, Derry Pramono
    Setiawan, Rudi
    Sugiarto, Indar
    Purnomo, Mauridhi Hery
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGING SYSTEMS AND TECHNIQUES (IST 2022), 2022,