Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

被引：18

作者：

Ma, Fei ^{[1
]}

Zhang, Wei ^{[1
]}

Li, Yang ^{[1
]}

Huang, Shao-Lun ^{[1
]}

Zhang, Lin ^{[1
]}

机构：

[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 20期

关键词：

audio-visual emotion recognition; common information; HGR maximal correlation; semi-supervised learning; FEATURES; CLASSIFICATION; FRAMEWORK;

D O I：

10.3390/app10207239

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE'05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.

引用

页码：1 / 23

页数：23

共 50 条

[21] Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition
Farhoudi, Zeinab
Setayeshi, Saeed
SPEECH COMMUNICATION, 2021, 127 : 92 - 103
[22] Joint low rank embedded multiple features learning for audio-visual emotion recognition
Wang, Zhan
Wang, Lizhi
Huang, Hua
NEUROCOMPUTING, 2020, 388 : 324 - 333
[23] Learning Representations from Audio-Visual Spatial Alignment
Morgado, Pedro
Li, Yi
Vasconcelos, Nuno
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[24] A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach
Seng, Kah Phooi
Ang, Li-Minn
Ooi, Chien Shing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (01) : 3 - 13
[25] MANDARIN AUDIO-VISUAL SPEECH RECOGNITION WITH EFFECTS TO THE NOISE AND EMOTION
Pao, Tsang-Long
Liao, Wen-Yuan
Chen, Yu-Te
Wu, Tsan-Nung
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2010, 6 (02): : 711 - 723
[26] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
[27] Semantic audio-visual data fusion for automatic emotion recognition
Datcu, Dragos
Rothkrantz, Leon J. M.
EUROMEDIA '2008, 2008, : 58 - 65
[28] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
Matsuda, Yuki
Fedotov, Dmitrii
Takahashi, Yuta
Arakawa, Yutaka
Yasumo, Keiichi
Minker, Wolfgang
PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
[29] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
Minh Tran
Soleymani, Mohammad
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
[30] DISENTANGLEMENT FOR AUDIO-VISUAL EMOTION RECOGNITION USING MULTITASK SETUP
Peri, Raghuveer
Parthasarathy, Srinivas
Bradshaw, Charles
Sundaram, Shiva
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6344 - 6348

← 1 2 3 4 5 →