Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

被引:8
|
作者
Praveen, R. Gnana [1 ]
Cardinal, Patrick [1 ]
Granger, Eric [1 ]
机构
[1] Ecole Technol Super, Dept Syst Engn, Lab Imagerie Vis & Intelligence Artificielle, Montreal, PQ H3C 1K3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Dimensional emotion recognition; deep learning; multimodal fusion; joint representation; cross-attention; SPEECH; ROBUST;
D O I
10.1109/TBIOM.2022.3233083
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-CrossAttention-for-Audio-Visual-Fusion.
引用
收藏
页码:360 / 373
页数:14
相关论文
共 50 条
  • [1] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [2] Audio-Visual Speaker Verification via Joint Cross-Attention
    Rajasekhar, Gnana Praveen
    Alam, Jahangir
    [J]. SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
  • [3] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
    Esam Ghaleb
    Jan Niehues
    Stylianos Asteriadis
    [J]. Multimedia Tools and Applications, 2023, 82 : 11239 - 11264
  • [4] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
    Ghaleb, Esam
    Niehues, Jan
    Asteriadis, Stylianos
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (08) : 11239 - 11264
  • [5] Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
    Praveen, R. Gnana
    Granger, Eric
    Cardinal, Patrick
    [J]. 2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,
  • [6] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    [J]. AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [7] Emotion Recognition in Valence-Arousal Scale by Using Physiological Signals
    Akalin, Neziha
    Kose, Hatice
    [J]. 2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [8] Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    Alam, Jahangir
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 444 - 458
  • [9] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
    Qian, Xinyuan
    Wang, Zhengdong
    Wang, Jiadong
    Guan, Guohui
    Li, Haizhou
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
  • [10] Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition
    Luo, Jiachen
    Phan, Huy
    Wang, Lin
    Reiss, Joshua
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,