Deep CNN with late fusion for real time multimodal emotion recognition

被引:4
|
作者
Dixit, Chhavi [1 ]
Satapathy, Shashank Mouli [2 ]
机构
[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India
[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India
关键词
CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking; SENTIMENT ANALYSIS; MODEL;
D O I
10.1016/j.eswa.2023.122579
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition
    Liu, Wei
    Qiu, Jie-Lin
    Zheng, Wei-Long
    Lu, Bao-Liang
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (02) : 715 - 729
  • [42] TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition
    Zhao, Zhengdao
    Wang, Yuhua
    Shen, Guang
    Xu, Yuezhu
    Zhang, Jiayuan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3771 - 3782
  • [43] SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations
    Yang, Juan
    Dong, Xuanxiong
    Du, Xu
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)
  • [44] A Retrospective CNN-LSVM Hybrid Approach for Multimodal Emotion Recognition
    Gill, Rupali
    Singh, Jaiteg
    Modgill, Aditi
    2022 INTERNATIONAL CONFERENCE ON DECISION AID SCIENCES AND APPLICATIONS (DASA), 2022, : 1281 - 1285
  • [45] Decision-Level Fusion Method for Emotion Recognition using Multimodal Emotion Recognition Information
    Song, Kyu-Seob
    Nho, Young-Hoon
    Seo, Ju-Hwan
    Kwon, Dong-Soo
    2018 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS (UR), 2018, : 472 - 476
  • [46] Multichannel Fusion Based on modified CNN for Image Emotion Recognition
    Zhao, Juntao
    Journal of Computers (Taiwan), 2022, 33 (01) : 13 - 19
  • [47] Multimodal Emotion Recognition Using Deep Neural Networks
    Tang, Hao
    Liu, Wei
    Zheng, Wei-Long
    Lu, Bao-Liang
    NEURAL INFORMATION PROCESSING (ICONIP 2017), PT IV, 2017, 10637 : 811 - 819
  • [48] DEEP MULTIMODAL LEARNING FOR EMOTION RECOGNITION IN SPOKEN LANGUAGE
    Gu, Yue
    Chen, Shuhong
    Marsic, Ivan
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5079 - 5083
  • [49] Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations
    Meng, Tao
    Shou, Yuntao
    Ai, Wei
    Yin, Nan
    Li, Keqin
    IEEE Transactions on Artificial Intelligence, 2024, 5 (12): : 6472 - 6487
  • [50] Multimodal Arabic emotion recognition using deep learning
    Al Roken, Noora
    Barlas, Gerassimos
    SPEECH COMMUNICATION, 2023, 155