Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

被引:16
|
作者
Kakuba, Samuel [1 ,2 ]
Poulose, Alwin [3 ]
Han, Dong Seog [4 ]
机构
[1] Kyungpook Natl Univ, Grad Sch Elect & Elect Engn, Daegu 41566, South Korea
[2] Kabale Univ, Fac Engn Technol Appl Design & Fine Art, Kabale, Uganda
[3] Univ Michigan, Dept Elect & Comp Engn, Dearborn, MI 48128 USA
[4] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea
关键词
Computational modeling; Convolution; Feature extraction; Emotion recognition; Speech recognition; Deep learning; Task analysis; multi-head attention; residual dilated causal convolution; LSTM;
D O I
10.1109/ACCESS.2022.3223705
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze long-term dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model's robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets.
引用
收藏
页码:122302 / 122313
页数:12
相关论文
共 50 条
  • [41] Machine Learning Approach for Emotion Recognition in Speech
    Gjoreski, Martin
    Gjoreski, Hristijan
    Kulakov, Andrea
    [J]. INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2014, 38 (04): : 377 - 383
  • [42] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
    Wang, Xiaofei
    Li, Ruizhi
    Mallidi, Sri Harish
    Hori, Takaaki
    Watanabe, Shinji
    Hermansky, Hynek
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109
  • [43] Design of Efficient Speech Emotion Recognition Based on Multi Task Learning
    Liu, Yunxiang
    Zhang, Kexin
    [J]. IEEE ACCESS, 2023, 11 : 5528 - 5537
  • [44] Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks
    Meng, Hao
    Yan, Tianhao
    Wei, Hongwei
    Ji, Xun
    [J]. BULLETIN OF THE POLISH ACADEMY OF SCIENCES-TECHNICAL SCIENCES, 2021, 69 (01)
  • [45] A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition
    Zhong, Ying
    Hu, Ying
    Huang, Hao
    Silamu, Wushour
    [J]. INTERSPEECH 2020, 2020, : 3331 - 3335
  • [46] EEG Emotion Recognition Network Based on Attention and Spatiotemporal Convolution
    Zhu, Xiaoliang
    Liu, Chen
    Zhao, Liang
    Wang, Shengming
    [J]. SENSORS, 2024, 24 (11)
  • [47] A Neural Autoregressive Approach to Attention-based Recognition
    Zheng, Yin
    Zemel, Richard S.
    Zhang, Yu-Jin
    Larochelle, Hugo
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 113 (01) : 67 - 79
  • [48] A Neural Autoregressive Approach to Attention-based Recognition
    Yin Zheng
    Richard S. Zemel
    Yu-Jin Zhang
    Hugo Larochelle
    [J]. International Journal of Computer Vision, 2015, 113 : 67 - 79
  • [49] Point clouds learning with attention-based graph convolution networks
    Xie, Zhuyang
    Chen, Junzhou
    Peng, Bo
    [J]. NEUROCOMPUTING, 2020, 402 : 245 - 255
  • [50] A Multitask Learning Approach Based on Cascaded Attention Network and Self-Adaption Loss for Speech Emotion Recognition
    Liu, Yang
    Xia, Yuqi
    Sun, Haoqin
    Meng, Xiaolei
    Bai, Jianxiong
    Guan, Wenbo
    Zhao, Zhen
    LI, Yongwei
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2023, E106A (06) : 876 - 885