Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

被引:0
|
作者
Shang, Yanan [1 ]
Fu, Tianqi [1 ]
机构
[1] Cangzhou Normal Univ, Cangzhou 061001, Hebei, Peoples R China
来源
关键词
Multimodal fusion; Deep learning; Glove model; BiGRU; Emotion recognition; NEURAL-NETWORK;
D O I
10.1016/j.iswa.2024.200436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multi- modal fusion recognition method, showing its practical applicability.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] A multimodal hierarchical approach to speech emotion recognition from audio and text
    Singh, Prabhav
    Srivastava, Ridam
    Rana, K. P. S.
    Kumar, Vineet
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [32] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [33] Deep Feature Extraction and Attention Fusion for Multimodal Emotion Recognition
    Yang, Zhiyi
    Li, Dahua
    Hou, Fazheng
    Song, Yu
    Gao, Qiang
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (03) : 1526 - 1530
  • [34] DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE
    Gu, Yue
    Chen, Shuhong
    Marsic, Ivan
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5079 - 5083
  • [35] Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations
    Meng, Tao
    Shou, Yuntao
    Ai, Wei
    Yin, Nan
    Li, Keqin
    [J]. IEEE Transactions on Artificial Intelligence, 2024, 5 (12): : 6472 - 6487
  • [36] Multimodal Arabic emotion recognition using deep learning
    Al Roken, Noora
    Barlas, Gerassimos
    [J]. SPEECH COMMUNICATION, 2023, 155
  • [37] Multimodal Emotion Recognition using Deep Learning Architectures
    Ranganathan, Hiranmayi
    Chakraborty, Shayok
    Panchanathan, Sethuraman
    [J]. 2016 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2016), 2016,
  • [38] Annotation Efficiency in Multimodal Emotion Recognition with Deep Learning
    Zhu, Lili
    Spachos, Petros
    [J]. 2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 560 - 565
  • [39] MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition
    Qi, Xin
    Wen, Yujun
    Zhang, Pengzhou
    Huang, Heyan
    [J]. Neurocomputing, 2025, 611
  • [40] Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition
    Wang, Yuhua
    Shen, Guang
    Xu, Yuezhu
    Li, Jiahang
    Zhao, Zhengdao
    [J]. INTERSPEECH 2021, 2021, : 4518 - 4522