Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition

被引:0
|
作者
Luo, Jiachen [1 ]
Phan, Huy [2 ]
Wang, Lin [1 ]
Reiss, Joshua [1 ]
机构
[1] Queen Mary Univ London, Ctr Digital Mus, London, England
[2] Amazon Alexa, Cambridge, MA USA
来源
2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW | 2023年
关键词
affective computing; modality fusion; attention mechanism; deep learning; FEATURES; DATABASES; MODELS;
D O I
10.1109/ACIIW59127.2023.10388147
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal emotion recognition is a challenging problem in the research fields of human-computer interaction and pattern recognition. How to efficiently find a common sub-space among the heterogeneous multimodal data is still an open problem for audio-video emotion recognition. In this work, we propose an attentive audio-video fusion network in an emotional dialogue system to learn attentive contextual dependency, speaker information, and the interaction of audio-video modalities. We employ pre-trained models, wav2vec, and Distract your Attention Network, to extract high-level audio and video representations, respectively. By using weighted fusion based on a cross-attention module, the cross-modality encoder focuses on the inter-modality relations and selectively captures effective information among the audio-video modality. Specifically, bidirectional gated recurrent unit models capture long-term contextual information, explore speaker influence, and learn intra- and inter-modal interactions of the audio and video modalities in a dynamic manner. We evaluate the approach on the MELD dataset, and the experimental results show that the proposed approach achieves state-of-the-art performance on the dataset.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition
    Mocanu, Bogdan
    Tapu, Ruxandra
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [2] A CROSS-ATTENTION EMOTION RECOGNITION ALGORITHM BASED ON AUDIO AND VIDEO MODALITIES
    Wu, Xiao
    Mu, Xuan
    Qi, Wen
    Liu, Xiaorui
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 309 - 313
  • [3] Active Speaker Recognition using Cross Attention Audio-Video Fusion
    Mocanu, Bogdan
    Tapu, Ruxandra
    2022 10TH EUROPEAN WORKSHOP ON VISUAL INFORMATION PROCESSING (EUVIP), 2022,
  • [4] Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning
    Mocanu, Bogdan
    Tapu, Ruxandra
    Zaharia, Titus
    IMAGE AND VISION COMPUTING, 2023, 133
  • [5] Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition
    Zhang, Yuanyuan
    Wang, Zi-Rui
    Du, Jun
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [6] Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition
    Zhou, Hengshun
    Meng, Debin
    Zhang, Yuanyuan
    Peng, Xiaojiang
    Du, Jun
    Wang, Kai
    Qiao, Yu
    ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, : 562 - 566
  • [7] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [8] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
    Praveen, R. Gnana
    Cardinal, Patrick
    Granger, Eric
    IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
  • [9] MSER: Multimodal speech emotion recognition using cross-attention with deep fusion
    Khan, Mustaqeem
    Gueaieb, Wail
    El Saddik, Abdulmotaleb
    Kwon, Soonil
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [10] Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition
    Gu, Yue
    Lyu, Xinyu
    Sun, Weijia
    Li, Weitian
    Chen, Shuhong
    Li, Xinyu
    Ivan, Marsic
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 157 - 165