Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

被引:18
|
作者
Ma, Fei [1 ]
Zhang, Wei [1 ]
Li, Yang [1 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 20期
关键词
audio-visual emotion recognition; common information; HGR maximal correlation; semi-supervised learning; FEATURES; CLASSIFICATION; FRAMEWORK;
D O I
10.3390/app10207239
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE'05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 50 条
  • [41] Emotion recognition using deep learning approach from audio-visual emotional big data
    Hossain, M. Shamim
    Muhammad, Ghulam
    INFORMATION FUSION, 2019, 49 : 69 - 78
  • [42] Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition
    Goncalves, Lucas
    Chou, Huang-Cheng
    Salman, Ali N.
    Lee, Chi-Chun
    Busso, Carlos
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 165 - 174
  • [43] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [44] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [45] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [46] Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning
    Chen, Chen
    Hu, Yuchen
    Zhang, Qiang
    Zou, Heqing
    Zhu, Beier
    Chng, Eng Siong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12607 - +
  • [47] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    Applied Intelligence, 2015, 42 : 722 - 737
  • [48] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [49] A Neural Network Architecture for Children's Audio-Visual Emotion Recognition
    Matveev, Anton
    Matveev, Yuri
    Frolova, Olga
    Nikolaev, Aleksandr
    Lyakso, Elena
    MATHEMATICS, 2023, 11 (22)
  • [50] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284