Mutual Cross-Attention in Dyadic Fusion Networks for Audio-Video Emotion Recognition

被引：0

作者：

Luo, Jiachen ^{[1
]}

Phan, Huy ^{[2
]}

Wang, Lin ^{[1
]}

Reiss, Joshua ^{[1
]}

机构：

[1] Queen Mary Univ London, Ctr Digital Mus, London, England

[2] Amazon Alexa, Cambridge, MA USA

来源：

2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW | 2023年

关键词：

affective computing; modality fusion; attention mechanism; deep learning; FEATURES; DATABASES; MODELS;

D O I：

10.1109/ACIIW59127.2023.10388147

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal emotion recognition is a challenging problem in the research fields of human-computer interaction and pattern recognition. How to efficiently find a common sub-space among the heterogeneous multimodal data is still an open problem for audio-video emotion recognition. In this work, we propose an attentive audio-video fusion network in an emotional dialogue system to learn attentive contextual dependency, speaker information, and the interaction of audio-video modalities. We employ pre-trained models, wav2vec, and Distract your Attention Network, to extract high-level audio and video representations, respectively. By using weighted fusion based on a cross-attention module, the cross-modality encoder focuses on the inter-modality relations and selectively captures effective information among the audio-video modality. Specifically, bidirectional gated recurrent unit models capture long-term contextual information, explore speaker influence, and learn intra- and inter-modal interactions of the audio and video modalities in a dynamic manner. We evaluate the approach on the MELD dataset, and the experimental results show that the proposed approach achieves state-of-the-art performance on the dataset.

引用

页数：7

共 50 条

[21] Multimodal Cross-Attention Bayesian Network for Social News Emotion Recognition
Wang, Xinzhi
Li, Mengyue
Chang, Yudong
Luo, Xiangfeng
Yao, Yige
Li, Zhichao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[22] Improved TOPSIS method for peak frame selection in audio-video human emotion recognition
Singh, Lovejit
Singh, Sarbjeet
Aggarwal, Naveen
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (05) : 6277 - 6308
[23] Improved TOPSIS method for peak frame selection in audio-video human emotion recognition
Lovejit Singh
Sarbjeet Singh
Naveen Aggarwal
Multimedia Tools and Applications, 2019, 78 : 6277 - 6308
[24] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
Rajan, Vandana
Brutti, Alessio
Cavallaro, Andrea
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
[25] AUDIO-VIDEO CORRESPONDENCE AND ITS ROLE IN ATTENTION AND MEMORY
GRIMES, T
ETR&D-EDUCATIONAL TECHNOLOGY RESEARCH AND DEVELOPMENT, 1990, 38 (03): : 15 - 25
[26] Audio-video people recognition system for an intelligent environment
Anzalone, Salvatore M.
Menegatti, Emanuele
Pagello, Enrico
Yoshikawa, Yuichiro
Ishiguro, Hiroshi
Chella, Antonio
4TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI 2011), 2011, : 237 - 244
[27] Behavior recognition of lactating sows using improved AVSlowFast audio-video fusion model
Li B.
Chen T.
Zhu J.
Nongye Gongcheng Xuebao/Transactions of the Chinese Society of Agricultural Engineering, 2024, 40 (07): : 182 - 190
[28] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
Li, Haibin
Guo, Aodi
Li, Yaqian
VISUAL COMPUTER, 2025, 41 (03): : 1609 - 1620
[29] Cross-Attention Transformer for Video Interpolation
Kim, Hannah Halin
Yu, Shuzhi
Yuan, Shuai
Tomasi, Carlo
COMPUTER VISION - ACCV 2022 WORKSHOPS, 2023, 13848 : 325 - 342
[30] Dense Graph Convolutional With Joint Cross-Attention Network for Multimodal Emotion Recognition
Cheng, Cheng
Liu, Wenzhe
Feng, Lin
Jia, Ziyu
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (05): : 6672 - 6683

← 1 2 3 4 5 →