Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

被引：1

作者：

Shang, Yanan ^{[1
]}

Fu, Tianqi ^{[1
]}

机构：

[1] Cangzhou Normal Univ, Cangzhou 061001, Hebei, Peoples R China

来源：

INTELLIGENT SYSTEMS WITH APPLICATIONS | 2024年 / 24卷

关键词：

Multimodal fusion; Deep learning; Glove model; BiGRU; Emotion recognition; NEURAL-NETWORK;

D O I：

10.1016/j.iswa.2024.200436

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multi- modal fusion recognition method, showing its practical applicability.

引用

页数：6

共 50 条

[31] Speech Emotion Recognition Using Deep Learning
Ahmed, Waqar
Riaz, Sana
Iftikhar, Khunsa
Konur, Savas
ARTIFICIAL INTELLIGENCE XL, AI 2023, 2023, 14381 : 191 - 197
[32] Factors in Emotion Recognition With Deep Learning Models Using Speech and Text on Multiple Corpora
Braunschweiler, Norbert
Doddipatla, Rama
Keizer, Simon
Stoyanchev, Svetlana
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 722 - 726
[33] MULTIMODAL EMOTION RECOGNITION WITH HIGH-LEVEL SPEECH AND TEXT FEATURES
Makiuchi, Mariana Rodrigues
Uto, Kuniaki
Shinoda, Koichi
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 350 - 357
[34] A multimodal hierarchical approach to speech emotion recognition from audio and text
Singh, Prabhav
Srivastava, Ridam
Rana, K. P. S.
Kumar, Vineet
KNOWLEDGE-BASED SYSTEMS, 2021, 229
[35] Unified Speech-Text Pre-training for Speech Translation and Recognition
Tang, Yun
Gong, Hongyu
Dong, Ning
Wang, Changhan
Hsu, Wei-Ning
Gu, Jiatao
Baevski, Alexei
Li, Xian
Mohamed, Abdelrahman
Auli, Michael
Pino, Juan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
[36] Deep Feature Extraction and Attention Fusion for Multimodal Emotion Recognition
Yang, Zhiyi
Li, Dahua
Hou, Fazheng
Song, Yu
Gao, Qiang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (03) : 1526 - 1530
[37] Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations
Meng, Tao
Shou, Yuntao
Ai, Wei
Yin, Nan
Li, Keqin
IEEE Transactions on Artificial Intelligence, 2024, 5 (12): : 6472 - 6487
[38] DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE
Gu, Yue
Chen, Shuhong
Marsic, Ivan
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5079 - 5083
[39] Multimodal Arabic emotion recognition using deep learning
Al Roken, Noora
Barlas, Gerassimos
SPEECH COMMUNICATION, 2023, 155
[40] Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion
Chakhtouna, Adil
Sekkate, Sara
Adib, Abdellah
ANNALS OF TELECOMMUNICATIONS, 2025,

← 1 2 3 4 5 →