A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition

被引:40
|
作者
Chen, Ming [1 ,2 ]
Zhao, Xudong [2 ]
机构
[1] Zhejiang Univ, 38 Zheda Rd, Hangzhou, Peoples R China
[2] Hithink RoyalFlush Informat Network Co Ltd, Hangzhou, Peoples R China
来源
关键词
speech emotion recognition; bimodal; multi-scale fusion strategy; feature fusion; ensemble learning; FEATURES;
D O I
10.21437/Interspeech.2020-3156
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A smodel, which takes advantage of convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and the attention mechanism, is proposed to learn speech representation from the logmel spectrogram extracted from speech data. Specifically, the CNN layers are utilized to learn local correlations. Then the Bi-LSTM layer is applied to learn long-term dependencies and contextual information. Finally, the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A tmodel using a pre-trained ALBERT model is applied for learning text representation from text data. Finally, a multi-scale fusion strategy, including feature fusion and ensemble learning, is applied to improve the overall performance. Experiments conducted on the public emotion dataset IEMOCAP have shown that the proposed STSER can achieve comparable recognition accuracy with fewer feature inputs.
引用
收藏
页码:374 / 378
页数:5
相关论文
共 50 条
  • [21] GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition*
    Ye, Jia-Xin
    Wen, Xin-Cheng
    Wang, Xuan-Ze
    Xu, Yong
    Luo, Yan
    Wu, Chang-Li
    Chen, Li-Yan
    Liu, Kun-Hong
    [J]. SPEECH COMMUNICATION, 2022, 145 : 21 - 35
  • [22] Hybrid Feature and Decision Level Fusion of Face and Speech Information for Bimodal Emotion Recognition
    Mansoorizadeh, Muharram
    Charkari, Nasrollah Moghaddam
    [J]. 2009 14TH INTERNATIONAL COMPUTER CONFERENCE, 2009, : 651 - 656
  • [23] A FUSION FRAMEWORK FOR FACE RECOGNITION UNDER VARYING ILLUMINATION BASED ON MULTI-SCALE ANALYSIS
    Chen, Hengxin
    Tang, Yuanyan
    Fang, Bin
    Zhou, Lifang
    [J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2012, 10 (03)
  • [24] MULTI-SCALE OCTAVE CONVOLUTIONS FOR ROBUST SPEECH RECOGNITION
    Rownicka, Joanna
    Bell, Peter
    Renals, Steve
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7019 - 7023
  • [25] Unconscious Emotion Recognition based on Multi-scale Sample Entropy
    Shi, Yanjing
    Zheng, Xiangwei
    Li, Tiantian
    [J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1221 - 1226
  • [26] Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition
    Zhao, Jingyu
    Li, Ruwei
    Tian, Maocun
    An, Weidong
    [J]. NEURAL PROCESSING LETTERS, 2024, 56 (04)
  • [27] Multi-feature Fusion Speech Emotion Recognition Based on SVM
    Zeng, Xiaoping
    Dong, Li
    Chen, Guanghui
    Dong, Qi
    [J]. PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 77 - 80
  • [28] A Method of Multi-Scale Forward Attention Model for Speech Recognition
    Tang, Hai-Tao
    Xue, Jia-Bin
    Han, Ji-Qing
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2020, 48 (07): : 1255 - 1260
  • [29] CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition
    Wu, Tianqi
    Wang, Liejun
    Zhang, Jiang
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT III, 2024, 14449 : 459 - 476
  • [30] Hierarchical framework for speech emotion recognition
    You, Mingyu
    Chen, Chun
    Bu, Jiajun
    Liu, Jia
    Tao, Jianhua
    [J]. 2006 IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS, VOLS 1-7, 2006, : 515 - +