Deep CNN with late fusion for real time multimodal emotion recognition

被引:4
|
作者
Dixit, Chhavi [1 ]
Satapathy, Shashank Mouli [2 ]
机构
[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India
[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India
关键词
CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking; SENTIMENT ANALYSIS; MODEL;
D O I
10.1016/j.eswa.2023.122579
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Real-time music emotion recognition based on multimodal fusion
    Hao, Xingye
    Li, Honghe
    Wen, Yonggang
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 116 : 586 - 600
  • [2] Real-time fear emotion recognition in mice based on multimodal data fusion
    Wang, Hao
    Shi, Zhanpeng
    Hu, Ruijie
    Wang, Xinyi
    Chen, Jian
    Che, Haoyuan
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [3] Deep Feature Extraction and Attention Fusion for Multimodal Emotion Recognition
    Yang, Zhiyi
    Li, Dahua
    Hou, Fazheng
    Song, Yu
    Gao, Qiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (03) : 1526 - 1530
  • [4] Robust CNN for facial emotion recognition and real-time GUI
    Ali I.
    Ghaffar F.
    AIMS Electronics and Electrical Engineering, 2024, 8 (02): : 217 - 236
  • [5] Emotion Recognition and Classification of Film Reviews Based on Deep Learning and Multimodal Fusion
    Na, Risu
    Sun, Ning
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [6] Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis
    Zhang, Ke
    Li, Yuanqing
    Wang, Jingyu
    Wang, Zhen
    Li, Xuelong
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1898 - 1902
  • [7] An early fusion approach for multimodal emotion recognition using deep recurrent networks
    Bucur, Beniamin
    Somfeleam, Iulia
    Ghiurutan, Alexandru
    Lcmnaru, Camelia
    Dinsoreanu, Mihaela
    2018 IEEE 14TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP), 2018, : 71 - 78
  • [8] Data Fusion for Real-time Multimodal Emotion Recognition through Webcams and Microphones in E-Learning
    Bahreini, Kiavash
    Nadolski, Rob
    Westera, Wim
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2016, 32 (05) : 415 - 430
  • [9] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [10] Multimodal Emotion Recognition Based on Feature Fusion
    Xu, Yurui
    Wu, Xiao
    Su, Hang
    Liu, Xiaorui
    2022 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2022), 2022, : 7 - 11