Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

被引:6
|
作者
Feng, Lin [1 ]
Liu, Lu-Yao [1 ]
Liu, Sheng-Lan [1 ]
Zhou, Jian [1 ]
Yang, Han-Qing [2 ]
Yang, Jie [3 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
[2] Washington Univ, Comp Sci & Engn, St Louis, MO USA
[3] Tsinghua Univ, Res Inst Informat Technol, Beijing 100084, Peoples R China
关键词
Speech emotion recognition; Multi-view attention; Mulit-scale MFCCs; AUDIO; INFORMATION;
D O I
10.1007/s11042-023-14600-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
引用
收藏
页码:28917 / 28935
页数:19
相关论文
共 50 条
  • [11] A Lightweight Multi-Scale Model for Speech Emotion Recognition
    Li, Haoming
    Zhao, Daqi
    Wang, Jingwen
    Wang, Deqiang
    IEEE ACCESS, 2024, 12 : 130228 - 130240
  • [12] Multi-Scale Temporal Transformer For Speech Emotion Recognition
    Li, Zhipeng
    Xing, Xiaofen
    Fang, Yuanbo
    Zhang, Weibin
    Fan, Hengsheng
    Xu, Xiangmin
    INTERSPEECH 2023, 2023, : 3652 - 3656
  • [13] Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition
    Zhao, Jingyu
    Li, Ruwei
    Tian, Maocun
    An, Weidong
    NEURAL PROCESSING LETTERS, 2024, 56 (04)
  • [14] Multi-Scale Multi-View Deep Feature Aggregation for Food Recognition
    Jiang, Shuqiang
    Min, Weiqing
    Liu, Linhu
    Luo, Zhengdong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 265 - 276
  • [15] Discriminative feature learning based on multi-view attention network with diffusion joint loss for speech emotion recognition
    Liu, Yang
    Chen, Xin
    Song, Yuan
    Li, Yarong
    Wang, Shengbei
    Yuan, Weitao
    Li, Yongwei
    Zhao, Zhen
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
  • [16] EEG Emotion Recognition Method Using Multi-Scale and Multi-Path Hybrid Attention Mechanism
    Gu, Xuejing
    Liu, Jia
    Guo, Yucheng
    Yang, Zhaohui
    Computer Engineering and Applications, 2024, 60 (19) : 130 - 138
  • [17] EMOTION RECOGNITION BASED ON MULTI-VIEW BODY GESTURES
    Shen, Zhijuan
    Cheng, Jun
    Hu, Xiping
    Dong, Qian
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3317 - 3321
  • [18] Multi-View Speech Emotion Recognition Via Collective Relation Construction
    Hou, Mixiao
    Zhang, Zheng
    Cao, Qi
    Zhang, David
    Lu, Guangming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 218 - 229
  • [19] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    INTERSPEECH 2020, 2020, : 374 - 378
  • [20] MULTI-VIEW VISUAL SPEECH RECOGNITION BASED ON MULTI TASK LEARNING
    Han, HouJeung
    Kang, Sunghun
    Yoo, Chang D.
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3983 - 3987