Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

被引:6
|
作者
Feng, Lin [1 ]
Liu, Lu-Yao [1 ]
Liu, Sheng-Lan [1 ]
Zhou, Jian [1 ]
Yang, Han-Qing [2 ]
Yang, Jie [3 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
[2] Washington Univ, Comp Sci & Engn, St Louis, MO USA
[3] Tsinghua Univ, Res Inst Informat Technol, Beijing 100084, Peoples R China
关键词
Speech emotion recognition; Multi-view attention; Mulit-scale MFCCs; AUDIO; INFORMATION;
D O I
10.1007/s11042-023-14600-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
引用
收藏
页码:28917 / 28935
页数:19
相关论文
共 50 条
  • [21] Multi-view domain adaption based multi-scale convolutional conditional invertible discriminator for cross-subject electroencephalogram emotion recognition
    Babu, S. Sivasaravana
    Venkatesan, Prabhu
    Velusamy, Parthasarathy
    Ganesan, Saravana Kumar
    COGNITIVE NEURODYNAMICS, 2025, 19 (01)
  • [22] Multi-view Remote Sensing Image Scene Classification by Fusing Multi-scale Attention
    Shi Y.
    Zhou W.
    Shao Z.
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2024, 49 (03): : 366 - 375
  • [23] A Method of Multi-Scale Forward Attention Model for Speech Recognition
    Tang H.-T.
    Xue J.-B.
    Han J.-Q.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2020, 48 (07): : 1255 - 1260
  • [24] Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion
    Yu, Lingli
    Xu, Fengjun
    Qu, Yundong
    Zhou, Kaijun
    APPLIED ACOUSTICS, 2024, 216
  • [25] A Multi-Scale Detector Based on Attention Mechanism
    Zhou, Lukuan
    Wang, Wei
    Wang, Qiang
    Sheng, Biyun
    Yang, Wankou
    2020 35TH YOUTH ACADEMIC ANNUAL CONFERENCE OF CHINESE ASSOCIATION OF AUTOMATION (YAC), 2020, : 110 - 115
  • [26] MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction
    Wang, Zhongyu
    Deng, Zhaohong
    Zhang, Wei
    Lou, Qiongdan
    Choi, Kup-Sze
    Wei, Zhisheng
    Wang, Lei
    Wu, Jing
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (04)
  • [27] SPEECH EMOTION RECOGNITION USING MULTI-HOP ATTENTION MECHANISM
    Yoon, Seunghyun
    Byun, Seokhyun
    Dey, Subhadeep
    Jung, Kyomin
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2822 - 2826
  • [28] SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
    Kang, Zuheng
    Peng, Junqing
    Wang, Jianzong
    Xiao, Jing
    INTERSPEECH 2022, 2022, : 4745 - 4749
  • [29] Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition
    Zheng W.
    Zheng W.
    Zong Y.
    Zheng, Wenming (wenming_zheng@seu.edu.cn), 1600, KeAi Communications Co. (03): : 65 - 75
  • [30] A Multi-scale Attention-based Facial Emotion Recognition Method Based on Deep Learning
    ZHANG Ning
    ZHANG Xiufeng
    FU Xingkui
    QI Guobin
    Instrumentation, 2022, 9 (03) : 51 - 58