Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

被引:8
|
作者
Praveen, R. Gnana [1 ]
Cardinal, Patrick [1 ]
Granger, Eric [1 ]
机构
[1] Ecole Technol Super, Dept Syst Engn, Lab Imagerie Vis & Intelligence Artificielle, Montreal, PQ H3C 1K3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Dimensional emotion recognition; deep learning; multimodal fusion; joint representation; cross-attention; SPEECH; ROBUST;
D O I
10.1109/TBIOM.2022.3233083
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-CrossAttention-for-Audio-Visual-Fusion.
引用
收藏
页码:360 / 373
页数:14
相关论文
共 50 条
  • [31] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    [J]. PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [32] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [33] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [34] Valence-Arousal Model based Emotion Recognition using EEG, peripheral physiological signals and Facial Expression
    Zhu, Qingyang
    Lu, Guanming
    Yan, Jingjie
    [J]. ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 81 - 85
  • [35] Joint low rank embedded multiple features learning for audio-visual emotion recognition
    Wang, Zhan
    Wang, Lizhi
    Huang, Hua
    [J]. NEUROCOMPUTING, 2020, 388 : 324 - 333
  • [36] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
    Zhou, Haodong
    Li, Tao
    Wang, Jie
    Li, Lin
    Hong, Qingyang
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
  • [37] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
    Duan, Bin
    Tang, Hao
    Wang, Wei
    Zong, Ziliang
    Yang, Guowei
    Yan, Yan
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
  • [38] Video clip recognition using joint audio-visual processing model
    Kulesh, V
    Petrushin, VA
    Sethi, IK
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL I, PROCEEDINGS, 2002, : 500 - 503
  • [39] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [40] Video clip recognition using joint audio-visual processing model
    Kulesh, Victor
    Petrushin, Valery A.
    Sethi, Ishwar K.
    [J]. Proceedings - International Conference on Pattern Recognition, 2002, 16 (01): : 500 - 503