Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引:0
|
作者
Liu, Yang [1 ]
Sun, Haoqin [1 ]
Guan, Wenbo [1 ]
Xia, Yuqi [1 ]
Zhao, Zhen [1 ]
机构
[1] School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao,266061, China
来源
Speech Communication | 2022年 / 139卷
关键词
Emotion Recognition - Information use - Speech recognition;
D O I
暂无
中图分类号
学科分类号
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively. © 2022 Elsevier B.V.
引用
收藏
页码:1 / 9
相关论文
共 50 条
  • [21] Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals
    Huang, Ning
    Xi, Zhengtao
    Jiao, Yingying
    Zhang, Yudong
    Jiao, Zhuqing
    Li, Xiaona
    [J]. Mathematical Biosciences and Engineering, 2024, 21 (08) : 6918 - 6935
  • [22] Multi-modal Emotion Recognition Based on Speech and Image
    Li, Yongqiang
    He, Qi
    Zhao, Yongping
    Yao, Hongxun
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT I, 2018, 10735 : 844 - 853
  • [23] Multi-modal Correlated Network for emotion recognition in speech
    Ren, Minjie
    Nie, Weizhi
    Liu, Anan
    Su, Yuting
    [J]. VISUAL INFORMATICS, 2019, 3 (03) : 150 - 155
  • [24] Multi-modal Emotion Recognition using Speech Features and Text Embedding
    Kim J.-H.
    Lee S.-P.
    [J]. Transactions of the Korean Institute of Electrical Engineers, 2021, 70 (01): : 108 - 113
  • [25] Multi-modal multi-head self-attention for medical VQA
    Joshi, Vasudha
    Mitra, Pabitra
    Bose, Supratik
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (14) : 42585 - 42608
  • [26] Multi-modal multi-head self-attention for medical VQA
    Vasudha Joshi
    Pabitra Mitra
    Supratik Bose
    [J]. Multimedia Tools and Applications, 2024, 83 : 42585 - 42608
  • [27] A multi-modal and multi-scale emotion-enhanced inference model based on fuzzy recognition
    Yan Yu
    Dong Qiu
    Ruiteng Yan
    [J]. Complex & Intelligent Systems, 2022, 8 : 1071 - 1084
  • [28] A multi-modal and multi-scale emotion-enhanced inference model based on fuzzy recognition
    Yu, Yan
    Qiu, Dong
    Yan, Ruiteng
    [J]. COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (02) : 1071 - 1084
  • [29] MMSNet: Multi-modal scene recognition using multi-scale encoded features
    Caglayan, Ali
    Imamoglu, Nevrez
    Nakamura, Ryosuke
    [J]. IMAGE AND VISION COMPUTING, 2022, 122
  • [30] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    [J]. INTERSPEECH 2019, 2019, : 2578 - 2582