Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

被引：8

作者：

Praveen, R. Gnana ^{[1
]}

Cardinal, Patrick ^{[1
]}

Granger, Eric ^{[1
]}

机构：

[1] Ecole Technol Super, Dept Syst Engn, Lab Imagerie Vis & Intelligence Artificielle, Montreal, PQ H3C 1K3, Canada

来源：

IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE | 2023年 / 5卷 / 03期

基金：

加拿大自然科学与工程研究理事会;

关键词：

Dimensional emotion recognition; deep learning; multimodal fusion; joint representation; cross-attention; SPEECH; ROBUST;

D O I：

10.1109/TBIOM.2022.3233083

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-CrossAttention-for-Audio-Visual-Fusion.

引用

页码：360 / 373

页数：14

共 50 条

[31] CATNet: Cross-modal fusion for audio-visual speech recognition
Wang, Xingmei
Mi, Jiachen
Li, Boquan
Zhao, Yixu
Meng, Jiaxiang
[J]. PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
[32] Audio-Visual Action Recognition Using Transformer Fusion Network
Kim, Jun-Hwa
Won, Chee Sun
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):
[33] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
[J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[34] Valence-Arousal Model based Emotion Recognition using EEG, peripheral physiological signals and Facial Expression
Zhu, Qingyang
Lu, Guanming
Yan, Jingjie
[J]. ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 81 - 85
[35] Joint low rank embedded multiple features learning for audio-visual emotion recognition
Wang, Zhan
Wang, Lizhi
Huang, Hua
[J]. NEUROCOMPUTING, 2020, 388 : 324 - 333
[36] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
Zhou, Haodong
Li, Tao
Wang, Jie
Li, Lin
Hong, Qingyang
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
[37] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Duan, Bin
Tang, Hao
Wang, Wei
Zong, Ziliang
Yang, Guowei
Yan, Yan
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
[38] Video clip recognition using joint audio-visual processing model
Kulesh, V
Petrushin, VA
Sethi, IK
[J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL I, PROCEEDINGS, 2002, : 500 - 503
[39] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
Middya, Asif Iqbal
Nag, Baibhav
Roy, Sarbani
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 244
[40] Video clip recognition using joint audio-visual processing model
Kulesh, Victor
Petrushin, Valery A.
Sethi, Ishwar K.
[J]. Proceedings - International Conference on Pattern Recognition, 2002, 16 (01): : 500 - 503

← 1 2 3 4 5 →