Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework

被引:0
|
作者
Liu, Yang [1 ]
Sun, Haoqin [1 ]
Guan, Wenbo [1 ]
Xia, Yuqi [1 ]
Zhao, Zhen [1 ]
机构
[1] School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao,266061, China
来源
Speech Communication | 2022年 / 139卷
关键词
Emotion Recognition - Information use - Speech recognition;
D O I
暂无
中图分类号
学科分类号
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively. © 2022 Elsevier B.V.
引用
收藏
页码:1 / 9
相关论文
共 50 条
  • [1] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    [J]. SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [2] Multi-modal Attention for Speech Emotion Recognition
    Pan, Zexu
    Luo, Zhaojie
    Yang, Jichen
    Li, Haizhou
    [J]. INTERSPEECH 2020, 2020, : 364 - 368
  • [3] A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism
    Xue, Cuihong
    Yu, Ming
    Yan, Gang
    Qin, Mengxian
    Liu, Yuehao
    Jia, Jingli
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (04) : 4303 - 4316
  • [4] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [5] Multi-head attention fusion networks for multi-modal speech emotion recognition
    Zhang, Junfeng
    Xing, Lining
    Tan, Zhen
    Wang, Hongsen
    Wang, Kesheng
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 168
  • [6] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    [J]. INTERSPEECH 2020, 2020, : 374 - 378
  • [7] ATTENTION DRIVEN FUSION FOR MULTI-MODAL EMOTION RECOGNITION
    Priyasad, Darshana
    Fernando, Tharindu
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3227 - 3231
  • [8] Multi-modal Scene Recognition Based on Global Self-attention Mechanism
    Li, Xiang
    Sun, Ning
    Liu, Jixin
    Chai, Lei
    Sun, Haian
    [J]. ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 109 - 121
  • [9] A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition
    Hu, Dongni
    Chen, Chengxin
    Zhang, Pengyuan
    Li, Junfeng
    Yan, Yonghong
    Zhao, Qingwei
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) : 1391 - 1394
  • [10] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024