Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

被引:6
|
作者
Feng, Lin [1 ]
Liu, Lu-Yao [1 ]
Liu, Sheng-Lan [1 ]
Zhou, Jian [1 ]
Yang, Han-Qing [2 ]
Yang, Jie [3 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
[2] Washington Univ, Comp Sci & Engn, St Louis, MO USA
[3] Tsinghua Univ, Res Inst Informat Technol, Beijing 100084, Peoples R China
关键词
Speech emotion recognition; Multi-view attention; Mulit-scale MFCCs; AUDIO; INFORMATION;
D O I
10.1007/s11042-023-14600-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
引用
收藏
页码:28917 / 28935
页数:19
相关论文
共 50 条
  • [31] Attention assessment based on multi-view classroom behaviour recognition
    Zheng, ZhouJie
    Liang, GuoJun
    Luo, HuiBin
    Yin, HaiChang
    IET COMPUTER VISION, 2022,
  • [32] Crop Disease Recognition Based on Attention Mechanism and Multi-scale Residual Network
    Huang L.
    Luo Y.
    Yang X.
    Yang G.
    Wang D.
    Nongye Jixie Xuebao/Transactions of the Chinese Society for Agricultural Machinery, 2021, 52 (10): : 264 - 271
  • [33] Improvement of Multimodal Emotion Recognition Based on Temporal-Aware Bi-Direction Multi-Scale Network and Multi-Head Attention Mechanisms
    Wu, Yuezhou
    Zhang, Siling
    Li, Pengfei
    APPLIED SCIENCES-BASEL, 2024, 14 (08):
  • [34] Multi-Scale Geometric Consistency Guided Multi-View Stereo
    Xu, Qingshan
    Tao, Wenbing
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5478 - 5487
  • [35] Image Manipulation Detection by Multi-View Multi-Scale Supervision
    Chen, Xinru
    Dong, Chengbo
    Ji, Jiaqi
    Cao, Juan
    Li, Xirong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14165 - 14173
  • [36] Multi-scale and multi-view network for lung tumor segmentation
    Liu C.
    Liu H.
    Zhang X.
    Guo J.
    Lv P.
    Computers in Biology and Medicine, 2024, 172
  • [37] Multi-View Attention Transfer for Efficient Speech Enhancement
    Shin, Wooseok
    Park, Hyun Joon
    Kim, Jin Sob
    Lee, Byung Hoon
    Han, Sung Won
    INTERSPEECH 2022, 2022, : 1198 - 1202
  • [38] Joint spatial and scale attention network for multi-view facial expression recognition
    Liu, Yuanyuan
    Peng, Jiyao
    Dai, Wei
    Zeng, Jiabei
    Shan, Shiguang
    PATTERN RECOGNITION, 2023, 139
  • [39] Unconscious Emotion Recognition based on Multi-scale Sample Entropy
    Shi, Yanjing
    Zheng, Xiangwei
    Li, Tiantian
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1221 - 1226
  • [40] MVSI-Net: Multi-view attention and multi-scale feature interaction for brain tumor segmentation
    Sun, Junding
    Hu, Ming
    Wu, Xiaosheng
    Tang, Chaosheng
    Lahza, Husam
    Wang, Shuihua
    Zhang, Yudong
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 95