Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

被引:16
|
作者
Kakuba, Samuel [1 ,2 ]
Poulose, Alwin [3 ]
Han, Dong Seog [4 ]
机构
[1] Kyungpook Natl Univ, Grad Sch Elect & Elect Engn, Daegu 41566, South Korea
[2] Kabale Univ, Fac Engn Technol Appl Design & Fine Art, Kabale, Uganda
[3] Univ Michigan, Dept Elect & Comp Engn, Dearborn, MI 48128 USA
[4] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea
关键词
Computational modeling; Convolution; Feature extraction; Emotion recognition; Speech recognition; Deep learning; Task analysis; multi-head attention; residual dilated causal convolution; LSTM;
D O I
10.1109/ACCESS.2022.3223705
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze long-term dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model's robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets.
引用
收藏
页码:122302 / 122313
页数:12
相关论文
共 50 条
  • [1] MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach
    Mustaqeem
    Kwon, Soonil
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 167
  • [2] Siamese Attention-Based LSTM for Speech Emotion Recognition
    Nizamidin, Tashpolat
    Zhao, Li
    Liang, Ruiyu
    Xie, Yue
    Hamdulla, Askar
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2020, E103A (07) : 937 - 941
  • [3] Attention-Based Dense LSTM for Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Zhao, Li
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (07): : 1426 - 1429
  • [4] Upgraded Attention-Based Local Feature Learning Block for Speech Emotion Recognition
    Zhao, Huan
    Gao, Yingxue
    Xiao, Yufeng
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II, 2021, 12713 : 118 - 130
  • [5] Multi-stream Attention-based BLSTM with Feature Segmentation for Speech Emotion Recognition
    Chiba, Yuya
    Nose, Takashi
    Ito, Akinori
    [J]. INTERSPEECH 2020, 2020, : 3301 - 3305
  • [6] Poster Abstract: Speech Emotion Recognition via Attention-based DNN from Multi-Task Learning
    Ma, Fei
    Gu, Weixi
    Zhang, Wei
    Ni, Shiguang
    Huang, Shao-Lun
    Zhang, Lin
    [J]. SENSYS'18: PROCEEDINGS OF THE 16TH CONFERENCE ON EMBEDDED NETWORKED SENSOR SYSTEMS, 2018, : 363 - 364
  • [7] Deep Learning Approaches for Bimodal Speech Emotion Recognition: Advancements, Challenges, and a Multi-Learning Model
    Kakuba, Samuel
    Poulose, Alwin
    Han, Dong Seog
    [J]. IEEE ACCESS, 2023, 11 : 113769 - 113789
  • [8] Attention-based LSTM with Multi-task Learning for Distant Speech Recognition
    Zhang, Yu
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3857 - 3861
  • [9] Attention-Based Convolution Skip Bidirectional Long Short-Term Memory Network for Speech Emotion Recognition
    Zhang, Huiyun
    Huang, Heming
    Han, Henry
    [J]. IEEE ACCESS, 2021, 9 : 5332 - 5342
  • [10] A Multi-scale Attention-based Facial Emotion Recognition Method Based on Deep Learning
    ZHANG Ning
    ZHANG Xiufeng
    FU Xingkui
    QI Guobin
    [J]. Instrumentation, 2022, 9 (03) : 51 - 58