Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

被引:8
|
作者
Praveen, R. Gnana [1 ]
Cardinal, Patrick [1 ]
Granger, Eric [1 ]
机构
[1] Ecole Technol Super, Dept Syst Engn, Lab Imagerie Vis & Intelligence Artificielle, Montreal, PQ H3C 1K3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Dimensional emotion recognition; deep learning; multimodal fusion; joint representation; cross-attention; SPEECH; ROBUST;
D O I
10.1109/TBIOM.2022.3233083
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-CrossAttention-for-Audio-Visual-Fusion.
引用
收藏
页码:360 / 373
页数:14
相关论文
共 50 条
  • [21] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
    Guo, Peini
    Chen, Zhengyan
    Li, Yidi
    Liu, Hong
    [J]. ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
  • [22] Multi-scale network with shared cross-attention for audio-visual correlation learning
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Li, Wei
    Wu, Jianming
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
  • [23] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    [J]. PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [24] DISENTANGLEMENT FOR AUDIO-VISUAL EMOTION RECOGNITION USING MULTITASK SETUP
    Peri, Raghuveer
    Parthasarathy, Srinivas
    Bradshaw, Charles
    Sundaram, Shiva
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6344 - 6348
  • [25] Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition
    Zhou, Hengshun
    Du, Jun
    Zhang, Yuanyuan
    Wang, Qing
    Liu, Qing-Feng
    Lee, Chin-Hui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2617 - 2629
  • [26] Dense Graph Convolutional With Joint Cross-Attention Network for Multimodal Emotion Recognition
    Cheng, Cheng
    Liu, Wenzhe
    Feng, Lin
    Jia, Ziyu
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, : 6672 - 6683
  • [27] Emotion Recognition From Audio-Visual Data Using Rule Based Decision Level Fusion
    Sahoo, Subhasmita
    Routray, Aurobinda
    [J]. PROCEEDINGS OF THE 2016 IEEE STUDENTS' TECHNOLOGY SYMPOSIUM (TECHSYM), 2016, : 7 - 12
  • [28] Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion
    Zhang, Su
    Ding, Yi
    Wei, Ziquan
    Guan, Cuntai
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3560 - 3567
  • [29] Feature and Decision Level Audio-visual Data Fusion in Emotion Recognition Problem
    Sidorov, Maxim
    Sopov, Evgenii
    Ivanov, Ilia
    Minker, Wolfgang
    [J]. ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 2, 2015, : 246 - 251
  • [30] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    [J]. PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222