Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引：0

作者：

Goncalves, Lucas ^{[1
]}

Chou, Huang-Cheng ^{[2
]}

Salman, Ali N. ^{[1
]}

Lee, Chi-Chun ^{[2
]}

Busso, Carlos ^{[1
,3
]}

机构：

[1] Univ Texas Dallas, Richardson, TX 75080 USA

[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷

基金：

美国国家科学基金会;

关键词：

Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;

D O I：

10.1109/OJSP.2025.3530274

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

引用

页码：165 / 174

页数：10

共 50 条

[21] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[22] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
Middya, Asif Iqbal
Nag, Baibhav
Roy, Sarbani
KNOWLEDGE-BASED SYSTEMS, 2022, 244
[23] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
Middya, Asif Iqbal
Nag, Baibhav
Roy, Sarbani
KNOWLEDGE-BASED SYSTEMS, 2022, 244
[24] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[25] Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
Tian, Qi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3030 - 3043
[26] Emotion recognition using deep learning approach from audio-visual emotional big data
Hossain, M. Shamim
Muhammad, Ghulam
INFORMATION FUSION, 2019, 49 : 69 - 78
[27] AUDIO-VISUAL EMOTION RECOGNITION WITH BOOSTED COUPLED HMM
Lu, Kun
Jia, Yunde
2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1148 - 1151
[28] Audio-visual based emotion recognition - A new approach
Song, ML
Bu, JJ
Chen, C
Li, N
PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 1020 - 1025
[29] An audio-visual corpus for multimodal automatic speech recognition
Andrzej Czyzewski
Bozena Kostek
Piotr Bratoszewski
Jozef Kotus
Marcin Szykulski
Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
[30] Temporal aggregation of audio-visual modalities for emotion recognition
Birhala, Andreea
Ristea, Catalin Nicolae
Radoi, Anamaria
Dutu, Liviu Cristian
2020 43RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2020, : 305 - 308

← 1 2 3 4 5 →