Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

被引：1

作者：

Shang, Yanan ^{[1
]}

Fu, Tianqi ^{[1
]}

机构：

[1] Cangzhou Normal Univ, Cangzhou 061001, Hebei, Peoples R China

来源：

INTELLIGENT SYSTEMS WITH APPLICATIONS | 2024年 / 24卷

关键词：

Multimodal fusion; Deep learning; Glove model; BiGRU; Emotion recognition; NEURAL-NETWORK;

D O I：

10.1016/j.iswa.2024.200436

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multi- modal fusion recognition method, showing its practical applicability.

引用

页数：6

共 50 条

[1] Overview of EmoSPeech at IberLEF 2024:Multimodal Speech-text Emotion Recognition in Spanish
Pan, Ronghao
Antonio Garcia-Diaz, Jose
Angel Rondriguez-Garcia, Miguel
Garcia-Sanchez, Francisco
Valencia-Garcia, Rafael
PROCESAMIENTO DEL LENGUAJE NATURAL, 2024, (73): : 359 - 368
[2] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
Lian, Hailun
Lu, Cheng
Li, Sunan
Zhao, Yan
Tang, Chuangao
Zong, Yuan
ENTROPY, 2023, 25 (10)
[3] Feature Fusion of Speech Emotion Recognition Based on Deep Learning
Liu, Gang
He, Wei
Jin, Bicheng
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 193 - 197
[4] Exploring Semantic Understanding and Generative Modeling in Speech-Text Multimodal Data Fusion
Yu, Haitao
Wang, Xuqiang
Sun, Yifan
Yang, Yifan
Sun, Yan
Applied Mathematics and Nonlinear Sciences, 2024, 9 (01)
[5] Learning deep multimodal affective features for spontaneous speech emotion recognition
Zhang, Shiqing
Tao, Xin
Chuang, Yuelong
Zhao, Xiaoming
SPEECH COMMUNICATION, 2021, 127 : 73 - 81
[6] MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
Yoon, Seunghyun
Byun, Seokhyun
Jung, Kyomin
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 112 - 118
[7] TS-MEFM: A New Multimodal Speech Emotion Recognition Network Based on Speech and Text Fusion
Wei, Wei
Zhang, Bingkun
Wang, Yibing
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 454 - 467
[8] Speech emotion recognition using multimodal feature fusion with machine learning approach
Sandeep Kumar Panda
Ajay Kumar Jena
Mohit Ranjan Panda
Susmita Panda
Multimedia Tools and Applications, 2023, 82 : 42763 - 42781
[9] Speech emotion recognition using multimodal feature fusion with machine learning approach
Panda, Sandeep Kumar
Jena, Ajay Kumar
Panda, Mohit Ranjan
Panda, Susmita
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42763 - 42781
[10] Polish Speech and Text Emotion Recognition in a Multimodal Emotion Analysis System
Skowronski, Kamil
Galuszka, Adam
Probierz, Eryka
APPLIED SCIENCES-BASEL, 2024, 14 (22):

← 1 2 3 4 5 →