Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引：0

作者：

Goncalves, Lucas ^{[1
]}

Chou, Huang-Cheng ^{[2
]}

Salman, Ali N. ^{[1
]}

Lee, Chi-Chun ^{[2
]}

Busso, Carlos ^{[1
,3
]}

机构：

[1] Univ Texas Dallas, Richardson, TX 75080 USA

[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷

基金：

美国国家科学基金会;

关键词：

Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;

D O I：

10.1109/OJSP.2025.3530274

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

引用

页码：165 / 174

页数：10

共 50 条

[31] An audio-visual corpus for multimodal automatic speech recognition
Czyzewski, Andrzej
Kostek, Bozena
Bratoszewski, Piotr
Kotus, Jozef
Szykulski, Marcin
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
[32] AUDIO-VISUAL EMOTION RECOGNITION USING BOLTZMANN ZIPPERS
Lu, Kun
Jia, Yunde
2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 2589 - 2592
[33] Fusion of Classifier Predictions for Audio-Visual Emotion Recognition
Noroozi, Fatemeh
Marjanovic, Marina
Njegus, Angelina
Escalera, Sergio
Anbarjafari, Gholamreza
2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 61 - 66
[34] Audio-visual emotion recognition with multilayer boosted HMM
Lü, Kun
Jia, Yun-De
Zhang, Xin
Lü, K. (kunlv@bit.edu.cn), 1600, Beijing Institute of Technology (22): : 89 - 93
[35] Deep emotion recognition based on audio-visual correlation
Hajarolasvadi, Noushin
Demirel, Hasan
IET COMPUTER VISION, 2020, 14 (07) : 517 - 527
[36] Audio-visual emotion recognition with multilayer boosted HMM
吕坤
贾云得
张欣
JournalofBeijingInstituteofTechnology, 2013, 22 (01) : 89 - 93
[37] Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
Ito, Koichiro
Fujioka, Takuya
Sun, Qinghua
Nagamatsu, Kenji
INTERSPEECH 2021, 2021, : 4493 - 4497
[38] Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition
Farhoudi, Zeinab
Setayeshi, Saeed
SPEECH COMMUNICATION, 2021, 127 : 92 - 103
[39] Transfer Learning from Audio-Visual Grounding to Speech Recognition
Hsu, Wei-Ning
Harwath, David
Glass, James
INTERSPEECH 2019, 2019, : 3242 - 3246
[40] Emotion recognition from unimodal to multimodal analysis: A review
Ezzameli, K.
Mahersia, H.
INFORMATION FUSION, 2023, 99

← 1 2 3 4 5 →