A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

被引：3

作者：

Dixit, Chhavi ^{[1
]}

Satapathy, Shashank Mouli ^{[2
]}

机构：

[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India

[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India

来源：

MULTIMEDIA SYSTEMS | 2023年 / 29卷 / 06期

关键词：

CNN; Cross-dataset; Deep neural network; ELMo; Multimodal emotion recognition; RNN; Stacking; FACIAL EXPRESSION; SENTIMENT ANALYSIS;

D O I：

10.1007/s00530-023-01188-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multimodal emotion recognition of videos of human oration, commonly called opinion videos, has a wide scope of applications across all domains. Here, the speakers express their views or opinions about various topics. This field is being researched by many with the aim of introducing accurate and efficient architectures for the same. This study also carries the same objective while exploring novel concepts in the field of emotion recognition. The proposed framework uses cross-dataset training and testing, so that the resultant architecture and models are unrestricted by the domain of input. It uses benchmark datasets and ensemble learning to make sure that even if the individual models are slightly biased, they can be countered by the learnings of the other models. Therefore, to achieve this objective with the mentioned novelties, three benchmark datasets, ISEAR, RAVDESS, and FER-2013, are used to train independent models for each of the three modalities of text, audio, and images. Another dataset is used in addition to the ISEAR dataset to train the text model. They are then combined and tested on the benchmark multimodal dataset of CMU-MOSEI. For the text analysis model, ELMo embedding and RNN are used, for audio, a simple DNN is used and for image emotion recognition, a 2D CNN is used after pre-processing. They are aggregated using the stacking technique for the final result. The complete architecture can be used as a partially pre-trained algorithm for the prediction of individual modalities, and partially trainable for stacking the results to get efficient emotion prediction based on input quality. The accuracy obtained on the CMU-MOSEI data set is 86.60% and the F1-score for the same is 0.84.

引用

页码：3151 / 3168

页数：18

共 50 条

[1] A customizable framework for multimodal emotion recognition using ensemble of deep neural network models
Chhavi Dixit
Shashank Mouli Satapathy
[J]. Multimedia Systems, 2023, 29 : 3151 - 3168
[2] Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network
Huang, Haiping
Hu, Zhenchao
Wang, Wenming
Wu, Min
[J]. IEEE ACCESS, 2020, 8 : 3265 - 3271
[3] Multimodal Emotion Recognition Using Deep Neural Networks
Tang, Hao
Liu, Wei
Zheng, Wei-Long
Lu, Bao-Liang
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT IV, 2017, 10637 : 811 - 819
[4] Emotion Recognition on Multimodal with Deep Learning and Ensemble
Dharma, David Adi
Zahra, Amalia
[J]. International Journal of Advanced Computer Science and Applications, 2022, 13 (12): : 656 - 663
[5] Emotion Recognition on Multimodal with Deep Learning and Ensemble
Dharma, David Adi
Zahra, Amalia
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 656 - 663
[6] Ensemble Convolution Neural Network for Robust Video Emotion Recognition Using Deep Semantics
Smitha, E.S.
Sendhilkumar, S.
Mahalakshmi, G.S.
[J]. Scientific Programming, 2023, 2023
[7] Acoustic Emotion Recognition using Deep Neural Network
Niu, Jianwei
Qian, Yanmin
Yu, Kai
[J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 128 - 132
[8] E-MFNN: an emotion-multimodal fusion neural network framework for emotion recognition
Guo, Zhuen
Yang, Mingqing
Lin, Li
Li, Jisong
Zhang, Shuyue
He, Qianbo
Gao, Jiaqi
Meng, Heling
Chen, Xinran
Tao, Yuehao
Yang, Chen
[J]. PEERJ COMPUTER SCIENCE, 2024, 10
[9] E-MFNN: an emotion-multimodal fusion neural network framework for emotion recognition
Guo, Zhuen
Yang, Mingqing
Lin, Li
Li, Jisong
Zhang, Shuyue
He, Qianbo
Gao, Jiaqi
Meng, Heling
Chen, Xinran
Tao, Yuehao
Yang, Chen
[J]. PEERJ COMPUTER SCIENCE, 2024, 10
[10] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
[J]. ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284

← 1 2 3 4 5 →