Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition

被引:24
|
作者
Farhoudi, Zeinab [1 ]
Setayeshi, Saeed [2 ]
机构
[1] Islamic Azad Univ, Dept Comp Engn, Sci & Res Branch, Tehran, Iran
[2] Amirkabir Univ Technol, Dept Energy Engn & Phys, Tehran, Iran
关键词
Audio-Visual emotion recognition; Brain emotional learning; Deep learning; Convolutional neural networks; Mixture of network; Multimodal fusion; MODEL;
D O I
10.1016/j.specom.2020.12.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multimodal emotion recognition is a challenging task due to different modalities emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two stages. First, deep learning methods, especially Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are applied to represent highly abstract features. Second, the fusion model, namely MoBEL, is designed to learn the previously joined audio-visual features simultaneously. For the visual modality representation, the 3D-CNN model has been used to learn the spatial-temporal features of visual expression. On the other hand, for the auditory modality, the Mel-spectrograms of speech signals have been fed into CNN-RNN for the spatial-temporal feature extraction. The high-level feature fusion approach with the MoBEL network is presented to make use of a correlation between the visual and auditory modalities for improving the performance of emotion recognition. The experimental results on the eNterface'05 database have been demonstrated that the performance of the proposed method is better than the hand-crafted features and the other state-of-the-art information fusion models in video emotion recognition.
引用
收藏
页码:92 / 103
页数:12
相关论文
共 50 条
  • [41] Audio-visual emotion fusion (AVEF): A deep efficient weighted approach
    Ma, Yaxiong
    Hao, Yixue
    Chen, Min
    Chen, Jincai
    Lu, Ping
    Kosir, Andrej
    INFORMATION FUSION, 2019, 46 : 184 - 192
  • [42] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Tamura, Satoshi
    Iribe, Yurie
    Takeda, Kazuya
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
  • [43] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [44] Audio-Visual Emotion Recognition using Gaussian Mixture Models for Face and Voice
    Metallinou, Angeliki
    Lee, Sungbok
    Narayanan, Shrikanth
    ISM: 2008 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, 2008, : 250 - 257
  • [45] Deep Reinforcement Learning for Audio-Visual Gaze Control
    Lathuiliere, Stephane
    Masse, Benoit
    Mesejo, Pablo
    Horaud, Radu
    2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 1555 - 1562
  • [46] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [47] Audio-Visual Emotion Recognition in Video Clips
    Noroozi, Fatemeh
    Marjanovic, Marina
    Njegus, Angelina
    Escalera, Sergio
    Anbarjafari, Gholamreza
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (01) : 60 - 75
  • [48] Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
    Middya A.I.
    Nag B.
    Roy S.
    Knowledge-Based Systems, 2022, 244
  • [49] Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies
    Chu, Eric
    Roy, Deb
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2017, : 829 - 834
  • [50] To Join or Not to Join: A Study on the Impact of Joint or Unimodal Representation Learning on Audio-Visual Emotion Recognition
    Hajavi, Amirhossein
    Singh, Harmanpreet
    Fashandi, Homa
    2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,