A Combined CNN Architecture for Speech Emotion Recognition

被引:0
|
作者
Begazo, Rolinson [1 ]
Aguilera, Ana [2 ,3 ]
Dongo, Irvin [1 ,4 ]
Cardinale, Yudith [5 ]
机构
[1] Univ Catolica San Pablo, Elect & Elect Engn Dept, Arequipa 04001, Peru
[2] Univ Valparaiso, Fac Ingn, Escuela Ingn Informat, Valparaiso 2340000, Chile
[3] Univ Valparaiso, Interdisciplinary Ctr Biomed Res & Hlth Engn MEDIN, Valparaiso 2340000, Chile
[4] Univ Bordeaux, ESTIA Inst Technol, F-64210 Bidart, France
[5] Univ Int Valencia, Grp Invest Ciencia Datos, Valencia 46002, Spain
关键词
speech emotion recognition; deep learning; spectral features; spectrogram imaging; feature fusion; convolutional neural network; NEURAL-NETWORKS; FEATURES; CORPUS;
D O I
10.3390/s24175797
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.
引用
收藏
页数:39
相关论文
共 50 条
  • [1] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
    Dongdong Li
    Linyu Sun
    Xinlei Xu
    Zhe Wang
    Jing Zhang
    Wenli Du
    [J]. Neural Processing Letters, 2021, 53 : 4097 - 4115
  • [2] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
    Li, Dongdong
    Sun, Linyu
    Xu, Xinlei
    Wang, Zhe
    Zhang, Jing
    Du, Wenli
    [J]. NEURAL PROCESSING LETTERS, 2021, 53 (06) : 4097 - 4115
  • [3] Speech Emotion Recognition Using CNN
    Huang, Zhengwei
    Dong, Ming
    Mao, Qirong
    Zhan, Yongzhao
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804
  • [4] Experimental Evaluation of CNN Architecture for Speech Recognition
    Haque, Md Amaan
    Verma, Abhishek
    Alex, John Sahaya Rani
    Venkatesan, Nithya
    [J]. FIRST INTERNATIONAL CONFERENCE ON SUSTAINABLE TECHNOLOGIES FOR COMPUTATIONAL INTELLIGENCE, 2020, 1045 : 507 - 514
  • [5] A BiLSTM-Transformer and 2D CNN Architecture for Emotion Recognition from Speech
    Kim, Sera
    Lee, Seok-Pil
    [J]. ELECTRONICS, 2023, 12 (19)
  • [7] Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion
    Liu, Yanlin
    Chen, Aibin
    Zhou, Guoxiong
    Yi, Jizheng
    Xiang, Jin
    Wang, Yaru
    [J]. Multimedia Tools and Applications, 2024, 83 (21) : 59839 - 59859
  • [8] COMPACT GRAPH ARCHITECTURE FOR SPEECH EMOTION RECOGNITION
    Shirian, Amir
    Guha, Tanaya
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6284 - 6288
  • [9] NEURAL ARCHITECTURE SEARCH FOR SPEECH EMOTION RECOGNITION
    Wu, Xixin
    Hu, Shoukang
    Wu, Zhiyong
    Liu, Xunying
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6902 - 6906
  • [10] Learning Salient Features for Speech Emotion Recognition Using CNN
    Liu, Jiamu
    Han, Wenjing
    Ruan, Huabin
    Chen, Xiaomin
    Jiang, Dongmei
    Li, Haifeng
    [J]. 2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,