Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition

被引:0
|
作者
Luo, Jiachen [1 ]
Phan, Huy [2 ]
Wang, Lin [1 ]
Reiss, Joshua [1 ]
机构
[1] Queen Mary Univ London, Ctr Digital Mus, London, England
[2] Amazon Alexa, Cambridge, MA USA
来源
2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW | 2023年
关键词
affective computing; modality fusion; attention mechanism; deep learning; FEATURES; DATABASES; MODELS;
D O I
10.1109/ACIIW59127.2023.10388147
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal emotion recognition is a challenging problem in the research fields of human-computer interaction and pattern recognition. How to efficiently find a common sub-space among the heterogeneous multimodal data is still an open problem for audio-video emotion recognition. In this work, we propose an attentive audio-video fusion network in an emotional dialogue system to learn attentive contextual dependency, speaker information, and the interaction of audio-video modalities. We employ pre-trained models, wav2vec, and Distract your Attention Network, to extract high-level audio and video representations, respectively. By using weighted fusion based on a cross-attention module, the cross-modality encoder focuses on the inter-modality relations and selectively captures effective information among the audio-video modality. Specifically, bidirectional gated recurrent unit models capture long-term contextual information, explore speaker influence, and learn intra- and inter-modal interactions of the audio and video modalities in a dynamic manner. We evaluate the approach on the MELD dataset, and the experimental results show that the proposed approach achieves state-of-the-art performance on the dataset.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] Multimodal Cross-Attention Bayesian Network for Social News Emotion Recognition
    Wang, Xinzhi
    Li, Mengyue
    Chang, Yudong
    Luo, Xiangfeng
    Yao, Yige
    Li, Zhichao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [22] Improved TOPSIS method for peak frame selection in audio-video human emotion recognition
    Singh, Lovejit
    Singh, Sarbjeet
    Aggarwal, Naveen
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (05) : 6277 - 6308
  • [23] Improved TOPSIS method for peak frame selection in audio-video human emotion recognition
    Lovejit Singh
    Sarbjeet Singh
    Naveen Aggarwal
    Multimedia Tools and Applications, 2019, 78 : 6277 - 6308
  • [24] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [25] AUDIO-VIDEO CORRESPONDENCE AND ITS ROLE IN ATTENTION AND MEMORY
    GRIMES, T
    ETR&D-EDUCATIONAL TECHNOLOGY RESEARCH AND DEVELOPMENT, 1990, 38 (03): : 15 - 25
  • [26] Audio-video people recognition system for an intelligent environment
    Anzalone, Salvatore M.
    Menegatti, Emanuele
    Pagello, Enrico
    Yoshikawa, Yuichiro
    Ishiguro, Hiroshi
    Chella, Antonio
    4TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI 2011), 2011, : 237 - 244
  • [27] Behavior recognition of lactating sows using improved AVSlowFast audio-video fusion model
    Li B.
    Chen T.
    Zhu J.
    Nongye Gongcheng Xuebao/Transactions of the Chinese Society of Agricultural Engineering, 2024, 40 (07): : 182 - 190
  • [28] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
    Li, Haibin
    Guo, Aodi
    Li, Yaqian
    VISUAL COMPUTER, 2025, 41 (03): : 1609 - 1620
  • [29] Cross-Attention Transformer for Video Interpolation
    Kim, Hannah Halin
    Yu, Shuzhi
    Yuan, Shuai
    Tomasi, Carlo
    COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 325 - 342
  • [30] Dense Graph Convolutional With Joint Cross-Attention Network for Multimodal Emotion Recognition
    Cheng, Cheng
    Liu, Wenzhe
    Feng, Lin
    Jia, Ziyu
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (05): : 6672 - 6683