AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition

被引：1

作者：

Das, Avishek ^{[1
]}

Sarma, Moumita Sen ^{[1
]}

Hoque, Mohammed Moshiul ^{[1
]}

Siddique, Nazmul ^{[2
]}

Dewan, M. Ali Akber ^{[3
]}

机构：

[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chittagong 4349, Bangladesh

[2] Ulster Univ, Sch Comp Engn & Intelligent Syst, Belfast BT15 1AP, North Ireland

[3] Athabasca Univ, Fac Sci & Technol, Sch Comp & Informat Syst, Athabasca, AB T9S 3A3, Canada

来源：

SENSORS | 2024年 / 24卷 / 18期

关键词：

multimodal emotion recognition; natural language processing; multimodal dataset; cross-modal attention; transformers;

D O I：

10.3390/s24185862

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.

引用

页数：23

共 50 条

[41] Audio-Visual Instance Discrimination with Cross-Modal Agreement
Morgado, Pedro
Vasconcelos, Nuno
Misra, Ishan
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12470 - 12481
[42] Cross-Modal Analysis of Audio-Visual Film Montage
Zeppelzauer, Matthias
Mitrovic, Dalibor
Breiteneder, Christian
2011 20TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2011,
[43] Evidence for Cross-Modal Integration of Emotional Audio/Visual Stimuli
Woloszyn, Michael Richard
Lauriente, Teagan Lee
CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2016, 70 (04): : 425 - 426
[44] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
INTERSPEECH 2021, 2021, : 1937 - 1941
[45] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
Li, Haibin
Guo, Aodi
Li, Yaqian
VISUAL COMPUTER, 2025, 41 (03): : 1609 - 1620
[46] Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition
Hong, Soyeon
Kang, Hyeoungguk
Cho, Hyunsouk
IEEE ACCESS, 2024, 12 : 14324 - 14333
[47] A Cross-Modal Correlation Fusion Network for Emotion Recognition in Conversations
Tang, Xiaolyu
Cai, Guoyong
Chen, Ming
Yuan, Peicong
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 55 - 68
[48] Visual, haptic and cross-modal recognition of objects and scenes
Woods, AT
Newell, FN
JOURNAL OF PHYSIOLOGY-PARIS, 2004, 98 (1-3) : 147 - 159
[49] FUNCTIONAL EMOTION TRANSFORMER FOR EEG-ASSISTED CROSS-MODAL EMOTION RECOGNITION
Jiang, Wei-Bang
Li, Ziyi
Zheng, Wei-Long
Lu, Bao-Liang
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1841 - 1845
[50] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
Yang, Dingkang
Huang, Shuai
Liu, Yang
Zhang, Lihua
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097

← 1 2 3 4 5 →