A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

被引：22

作者：

Praveen, R. Gnana ^{[1
]}

de Melo, Wheidima Carneiro ^{[1
]}

Ullah, Nasib ^{[1
]}

Aslam, Haseeb ^{[1
]}

Zeeshan, Osama ^{[1
]}

Denorme, Theo ^{[1
]}

Pedersoli, Marco ^{[1
]}

Koerich, Alessandro L. ^{[1
]}

Bacon, Simon ^{[2
]}

Cardinal, Patrick ^{[1
]}

Granger, Eric ^{[1
]}

机构：

[1] Ecole Technol Super, LIVIA, Montreal, PQ, Canada

[2] Concordia Univ, Dept Hlth Kinesiol & Appl Physiol, Montreal, PQ, Canada

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年

关键词：

D O I：

10.1109/CVPRW56347.2022.00278

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the-art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results(1) on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).

引用

页码：2485 / 2494

页数：10

共 50 条

[1] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Praveen, R. Gnana
Cardinal, Patrick
Granger, Eric
[J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
[2] Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
Praveen, R. Gnana
Granger, Eric
Cardinal, Patrick
[J]. 2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,
[3] Audio-Visual Speaker Verification via Joint Cross-Attention
Rajasekhar, Gnana Praveen
Alam, Jahangir
[J]. SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
[4] Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
Praveen, R. Gnana
Alam, Jahangir
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 444 - 458
[5] Audio-Visual Attention Networks for Emotion Recognition
Lee, Jiyoung
Kim, Sunok
Kim, Seungryong
Sohn, Kwanghoon
[J]. AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
[6] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
Esam Ghaleb
Jan Niehues
Stylianos Asteriadis
[J]. Multimedia Tools and Applications, 2023, 82 : 11239 - 11264
[7] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
Ghaleb, Esam
Niehues, Jan
Asteriadis, Stylianos
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (08) : 11239 - 11264
[8] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Qian, Xinyuan
Wang, Zhengdong
Wang, Jiadong
Guan, Guohui
Li, Haizhou
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
[9] Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition
Luo, Jiachen
Phan, Huy
Wang, Lin
Reiss, Joshua
[J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
[10] Fusion of Classifier Predictions for Audio-Visual Emotion Recognition
Noroozi, Fatemeh
Marjanovic, Marina
Njegus, Angelina
Escalera, Sergio
Anbarjafari, Gholamreza
[J]. 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 61 - 66

← 1 2 3 4 5 →