A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition

被引:40
|
作者
Chen, Ming [1 ,2 ]
Zhao, Xudong [2 ]
机构
[1] Zhejiang Univ, 38 Zheda Rd, Hangzhou, Peoples R China
[2] Hithink RoyalFlush Informat Network Co Ltd, Hangzhou, Peoples R China
来源
关键词
speech emotion recognition; bimodal; multi-scale fusion strategy; feature fusion; ensemble learning; FEATURES;
D O I
10.21437/Interspeech.2020-3156
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A smodel, which takes advantage of convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and the attention mechanism, is proposed to learn speech representation from the logmel spectrogram extracted from speech data. Specifically, the CNN layers are utilized to learn local correlations. Then the Bi-LSTM layer is applied to learn long-term dependencies and contextual information. Finally, the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A tmodel using a pre-trained ALBERT model is applied for learning text representation from text data. Finally, a multi-scale fusion strategy, including feature fusion and ensemble learning, is applied to improve the overall performance. Experiments conducted on the public emotion dataset IEMOCAP have shown that the proposed STSER can achieve comparable recognition accuracy with fewer feature inputs.
引用
收藏
页码:374 / 378
页数:5
相关论文
共 50 条
  • [1] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    [J]. SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [2] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    [J]. Speech Communication, 2022, 139 : 1 - 9
  • [3] A Lightweight Multi-Scale Model for Speech Emotion Recognition
    Li, Haoming
    Zhao, Daqi
    Wang, Jingwen
    Wang, Deqiang
    [J]. IEEE ACCESS, 2024, 12 : 130228 - 130240
  • [4] SPEECH EMOTION RECOGNITION WITH GLOBAL-AWARE FUSION ON MULTI-SCALE FEATURE REPRESENTATION
    Zhu, Wenjing
    Li, Xiang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6437 - 6441
  • [5] EEG Emotion Recognition by Fusion of Multi-Scale Features
    Du, Xiuli
    Meng, Yifei
    Qiu, Shaoming
    Lv, Yana
    Liu, Qingli
    [J]. BRAIN SCIENCES, 2023, 13 (09)
  • [6] Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion
    Yu, Lingli
    Xu, Fengjun
    Qu, Yundong
    Zhou, Kaijun
    [J]. APPLIED ACOUSTICS, 2024, 216
  • [7] Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition
    Zheng, Wanlu
    Zheng, Wenming
    Zong, Yuan
    [J]. Zheng, Wenming (wenming_zheng@seu.edu.cn), 1600, KeAi Communications Co. (03): : 65 - 75
  • [8] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
  • [9] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [10] EEG emotion recognition approach using multi-scale convolution and feature fusion
    Zhang, Yong
    Shan, Qingguo
    Chen, Wenyun
    Liu, Wenzhe
    [J]. VISUAL COMPUTER, 2024,