Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

被引:6
|
作者
Feng, Lin [1 ]
Liu, Lu-Yao [1 ]
Liu, Sheng-Lan [1 ]
Zhou, Jian [1 ]
Yang, Han-Qing [2 ]
Yang, Jie [3 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
[2] Washington Univ, Comp Sci & Engn, St Louis, MO USA
[3] Tsinghua Univ, Res Inst Informat Technol, Beijing 100084, Peoples R China
关键词
Speech emotion recognition; Multi-view attention; Mulit-scale MFCCs; AUDIO; INFORMATION;
D O I
10.1007/s11042-023-14600-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
引用
收藏
页码:28917 / 28935
页数:19
相关论文
共 50 条
  • [1] Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism
    Lin Feng
    Lu-Yao Liu
    Sheng-Lan Liu
    Jian Zhou
    Han-Qing Yang
    Jie Yang
    Multimedia Tools and Applications, 2023, 82 : 28917 - 28935
  • [2] Multi-view and multi-scale behavior recognition algorithm based on attention mechanism
    Zhang, Di
    Chen, Chen
    Tan, Fa
    Qian, Beibei
    Li, Wei
    He, Xuan
    Lei, Susan
    FRONTIERS IN NEUROROBOTICS, 2023, 17
  • [3] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [4] Multi-scale attention and loss penalty mechanism for multi-view clustering
    Wang, Tingyu
    Zhai, Rui
    Wang, Longge
    Yu, Junyang
    Li, Han
    Wang, Zhicheng
    Wu, Jinhu
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [5] Multimodal and Multi-view Models for Emotion Recognition
    Aguilar, Gustavo
    Rozgic, Viktor
    Wang, Weiran
    Wang, Chao
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 991 - 1002
  • [6] Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition
    Jalal, Md Asif
    Milner, Rosanna
    Hain, Thomas
    Moore, Roger K.
    INTERSPEECH 2020, 2020, : 4084 - 4088
  • [7] Multi-view and Multi-scale Recognition of Symmetric Patterns
    Teferi, Dereje
    Bigun, Josef
    IMAGE ANALYSIS, PROCEEDINGS, 2009, 5575 : 657 - 666
  • [8] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
  • [9] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [10] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    Speech Communication, 2022, 139 : 1 - 9