Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

被引：1

作者：

Shang, Yanan ^{[1
]}

Fu, Tianqi ^{[1
]}

机构：

[1] Cangzhou Normal Univ, Cangzhou 061001, Hebei, Peoples R China

来源：

INTELLIGENT SYSTEMS WITH APPLICATIONS | 2024年 / 24卷

关键词：

Multimodal fusion; Deep learning; Glove model; BiGRU; Emotion recognition; NEURAL-NETWORK;

D O I：

10.1016/j.iswa.2024.200436

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multi- modal fusion recognition method, showing its practical applicability.

引用

页数：6

共 50 条

[41] Multimodal Emotion Recognition using Deep Learning Architectures
Ranganathan, Hiranmayi
Chakraborty, Shayok
Panchanathan, Sethuraman
2016 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2016), 2016,
[42] Deep Learning-Based Emotion Recognition by Fusion of Facial Expressions and Speech Features
Vardhan, Jasthi Vivek
Chakravarti, Yelavarti Kalyan
Chand, Annam Jitin
2024 2ND WORLD CONFERENCE ON COMMUNICATION & COMPUTING, WCONF 2024, 2024,
[43] Annotation Efficiency in Multimodal Emotion Recognition with Deep Learning
Zhu, Lili
Spachos, Petros
2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 560 - 565
[44] MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition
Qi, Xin
Wen, Yujun
Zhang, Pengzhou
Huang, Heyan
NEUROCOMPUTING, 2025, 611
[45] Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition
Wang, Yuhua
Shen, Guang
Xu, Yuezhu
Li, Jiahang
Zhao, Zhengdao
INTERSPEECH 2021, 2021, : 4518 - 4522
[46] Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition
Liu, Wei
Qiu, Jie-Lin
Zheng, Wei-Long
Lu, Bao-Liang
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (02) : 715 - 729
[47] A Cross-Culture Study on Multimodal Emotion Recognition Using Deep Learning
Gan, Lu
Liu, Wei
Luo, Yun
Wu, Xun
Lu, Bao-Liang
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT IV, 2019, 1142 : 670 - 680
[48] Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
Lee, Chan Woo
Song, Kyu Ye
Jeong, Jihoon
Choi, Woo Yong
FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), 2018, : 28 - 34
[49] Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text
Lee, Yoonhyung
Yoon, Seunghyun
Jung, Kyomin
INTERSPEECH 2020, 2020, : 2717 - 2721
[50] Ensemble deep learning with HuBERT for speech emotion recognition
Yang, Janghoon
2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 153 - 154

← 1 2 3 4 5 →