Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引:0
|
作者
Goncalves, Lucas [1 ]
Chou, Huang-Cheng [2 ]
Salman, Ali N. [1 ]
Lee, Chi-Chun [2 ]
Busso, Carlos [1 ,3 ]
机构
[1] Univ Texas Dallas, Richardson, TX 75080 USA
[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;
D O I
10.1109/OJSP.2025.3530274
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
  • [31] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [32] AUDIO-VISUAL EMOTION RECOGNITION USING BOLTZMANN ZIPPERS
    Lu, Kun
    Jia, Yunde
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 2589 - 2592
  • [33] Fusion of Classifier Predictions for Audio-Visual Emotion Recognition
    Noroozi, Fatemeh
    Marjanovic, Marina
    Njegus, Angelina
    Escalera, Sergio
    Anbarjafari, Gholamreza
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 61 - 66
  • [34] Audio-visual emotion recognition with multilayer boosted HMM
    Lü, Kun
    Jia, Yun-De
    Zhang, Xin
    Lü, K. (kunlv@bit.edu.cn), 1600, Beijing Institute of Technology (22): : 89 - 93
  • [35] Deep emotion recognition based on audio-visual correlation
    Hajarolasvadi, Noushin
    Demirel, Hasan
    IET COMPUTER VISION, 2020, 14 (07) : 517 - 527
  • [36] Audio-visual emotion recognition with multilayer boosted HMM
    吕坤
    贾云得
    张欣
    JournalofBeijingInstituteofTechnology, 2013, 22 (01) : 89 - 93
  • [37] Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
    Ito, Koichiro
    Fujioka, Takuya
    Sun, Qinghua
    Nagamatsu, Kenji
    INTERSPEECH 2021, 2021, : 4493 - 4497
  • [38] Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition
    Farhoudi, Zeinab
    Setayeshi, Saeed
    SPEECH COMMUNICATION, 2021, 127 : 92 - 103
  • [39] Transfer Learning from Audio-Visual Grounding to Speech Recognition
    Hsu, Wei-Ning
    Harwath, David
    Glass, James
    INTERSPEECH 2019, 2019, : 3242 - 3246
  • [40] Emotion recognition from unimodal to multimodal analysis: A review
    Ezzameli, K.
    Mahersia, H.
    INFORMATION FUSION, 2023, 99