Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

被引:18
|
作者
Ma, Fei [1 ]
Zhang, Wei [1 ]
Li, Yang [1 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 20期
关键词
audio-visual emotion recognition; common information; HGR maximal correlation; semi-supervised learning; FEATURES; CLASSIFICATION; FRAMEWORK;
D O I
10.3390/app10207239
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE'05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 50 条
  • [31] ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition
    Kim, Yelin
    Provost, Emily Mower
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 196 - 208
  • [32] Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
    Praveen, R. Gnana
    Granger, Eric
    Cardinal, Patrick
    2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,
  • [33] Audio-Visual Biometric Recognition Via Joint Sparse Representations
    Primorac, Rudi
    Togneri, Roberto
    Bennamoun, Mohammed
    Sohel, Ferdous
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3031 - 3035
  • [34] Construction of Japanese Audio-Visual Emotion Database and Its Application in Emotion Recognition
    Lubis, Nurul
    Gomez, Randy
    Sakti, Sakriani
    Nakamura, Keisuke
    Yoshino, Koichiro
    Nakamura, Satoshi
    Nakadai, Kazuhiro
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2180 - 2184
  • [35] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [36] A new information fusion method for SVM-Based robotic audio-visual emotion recognition
    Han, Meng-Ju
    Hsu, Jing-Huai
    Song, Kai-Tai
    Chang, Fuh-Yu
    2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 2464 - +
  • [37] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [38] Deep Learning for Audio Visual Emotion Recognition
    Hussain, T.
    Wang, W.
    Bouaynaya, N.
    Fathallah-Shaykh, H.
    Mihaylova, L.
    2022 25TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2022), 2022,
  • [39] To Join or Not to Join: A Study on the Impact of Joint or Unimodal Representation Learning on Audio-Visual Emotion Recognition
    Hajavi, Amirhossein
    Singh, Harmanpreet
    Fashandi, Homa
    2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,
  • [40] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
    Takashima, Akihiko
    Masumura, Ryo
    Ando, Atsushi
    Yamazaki, Yoshihiro
    Uchida, Mihiro
    Orihashi, Shota
    INTERSPEECH 2022, 2022, : 4740 - 4744