Deep CNN with late fusion for real time multimodal emotion recognition

被引:4
|
作者
Dixit, Chhavi [1 ]
Satapathy, Shashank Mouli [2 ]
机构
[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India
[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India
关键词
CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking; SENTIMENT ANALYSIS; MODEL;
D O I
10.1016/j.eswa.2023.122579
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Context-aware Multimodal Fusion for Emotion Recognition
    Li, Jinchao
    Wang, Shuai
    Chao, Yang
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2022, 2022, : 2013 - 2017
  • [32] Multimodal transformer augmented fusion for speech emotion recognition
    Wang, Yuanyuan
    Gu, Yu
    Yin, Yifei
    Han, Yingping
    Zhang, He
    Wang, Shuang
    Li, Chenyu
    Quan, Dou
    FRONTIERS IN NEUROROBOTICS, 2023, 17
  • [33] Multimodal Physiological Signals Fusion for Online Emotion Recognition
    Pan, Tongjie
    Ye, Yalan
    Cai, Hecheng
    Huang, Shudong
    Yang, Yang
    Wang, Guoqing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5879 - 5888
  • [34] Review on Multimodal Fusion Techniques for Human Emotion Recognition
    Karani, Ruhina
    Desai, Sharmishta
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (10) : 287 - 296
  • [35] An Improved Deep Fusion CNN for Image Recognition
    Chen, Rongyu
    Pan, Lili
    Li, Cong
    Zhou, Yan
    Chen, Aibin
    Beckman, Eric
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (02): : 1691 - 1706
  • [36] Deep learning-based late fusion of multimodal information for emotion classification of music video
    Pandeya, Yagya Raj
    Lee, Joonwhoan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (02) : 2887 - 2905
  • [37] Deep learning-based late fusion of multimodal information for emotion classification of music video
    Yagya Raj Pandeya
    Joonwhoan Lee
    Multimedia Tools and Applications, 2021, 80 : 2887 - 2905
  • [38] Real-Time Emotion Recognition Using Deep Learning Algorithms
    El Mettiti, Abderrahmane
    Oumsis, Mohammed
    Chehri, Abdellah
    Saadane, Rachid
    2022 IEEE 96TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2022-FALL), 2022,
  • [39] Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
    Dung Nguyen
    Kien Nguyen
    Sridharan, Sridha
    Dean, David
    Fookes, Clinton
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 174 : 33 - 42
  • [40] Emotion Recognition From Multimodal Physiological Signals Using a Regularized Deep Fusion of Kernel Machine
    Zhang, Xiaowei
    Liu, Jinyong
    Shen, Jian
    Li, Shaojie
    Hou, Kechen
    Hu, Bin
    Gao, Jin
    Zhang, Tong
    IEEE TRANSACTIONS ON CYBERNETICS, 2021, 51 (09) : 4386 - 4399