Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

被引:6
|
作者
Feng, Lin [1 ]
Liu, Lu-Yao [1 ]
Liu, Sheng-Lan [1 ]
Zhou, Jian [1 ]
Yang, Han-Qing [2 ]
Yang, Jie [3 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
[2] Washington Univ, Comp Sci & Engn, St Louis, MO USA
[3] Tsinghua Univ, Res Inst Informat Technol, Beijing 100084, Peoples R China
关键词
Speech emotion recognition; Multi-view attention; Mulit-scale MFCCs; AUDIO; INFORMATION;
D O I
10.1007/s11042-023-14600-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
引用
收藏
页码:28917 / 28935
页数:19
相关论文
共 50 条
  • [41] A multi-view and multi-scale transfer learning based wind farm equivalent method
    Han, Ji
    Miao, Shihong
    Li, Yaowang
    Yang, Weichen
    Zheng, Tingting
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2020, 117
  • [42] Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition
    Alastruey, Belen
    Drude, Lukas
    Heymann, Jahn
    Wiesler, Simon
    INTERSPEECH 2023, 2023, : 4973 - 4977
  • [43] Rice Ears Detection Method Based on Multi-Scale Image Recognition and Attention Mechanism
    Qiu, Fen
    Shen, Xiaojun
    Zhou, Cheng
    He, Wuming
    Yao, Lili
    IEEE ACCESS, 2024, 12 : 68637 - 68647
  • [44] A lightweight network for traffic sign recognition based on multi-scale feature and attention mechanism
    Wei, Wei
    Zhang, Lili
    Yang, Kang
    Li, Jing
    Cui, Ning
    Han, Yucheng
    Zhang, Ning
    Yang, Xudong
    Tan, Hongxin
    Wang, Kai
    HELIYON, 2024, 10 (04)
  • [45] Multi-modal Attention for Speech Emotion Recognition
    Pan, Zexu
    Luo, Zhaojie
    Yang, Jichen
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 364 - 368
  • [46] MULTI-VIEW SELF-ATTENTION BASED TRANSFORMER FOR SPEAKER RECOGNITION
    Wang, Rui
    Ao, Junyi
    Zhou, Long
    Liu, Shujie
    Wei, Zhihua
    Ko, Tom
    Li, Qing
    Zhang, Yu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6732 - 6736
  • [47] Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder
    Huang, Feiran
    Zhang, Xiaoming
    Li, Chaozhuo
    Li, Zhoujun
    He, Yueying
    Zhao, Zhonghua
    ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 108 - 116
  • [48] Multi-Scale Integrated Attention Mechanism for Facial Expression Recognition Network
    Luo, Sishi
    Li, Maojun
    Chen, Man
    Computer Engineering and Applications, 2023, 59 (01): : 199 - 206
  • [49] Action Recognition with a Multi-View Temporal Attention Network
    Dengdi Sun
    Zhixiang Su
    Zhuanlian Ding
    Bin Luo
    Cognitive Computation, 2022, 14 : 1082 - 1095
  • [50] A Multi-scale Temporal Model for Multi-view Radar High-resolution Range Profile Recognition
    Peng, Xuan
    Gao, Xunzhang
    Li, Xiang
    Chen, Dingchang
    PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 1681 - 1686