AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition

被引：1

作者：

Das, Avishek ^{[1
]}

Sarma, Moumita Sen ^{[1
]}

Hoque, Mohammed Moshiul ^{[1
]}

Siddique, Nazmul ^{[2
]}

Dewan, M. Ali Akber ^{[3
]}

机构：

[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chittagong 4349, Bangladesh

[2] Ulster Univ, Sch Comp Engn & Intelligent Syst, Belfast BT15 1AP, North Ireland

[3] Athabasca Univ, Fac Sci & Technol, Sch Comp & Informat Syst, Athabasca, AB T9S 3A3, Canada

来源：

SENSORS | 2024年 / 24卷 / 18期

关键词：

multimodal emotion recognition; natural language processing; multimodal dataset; cross-modal attention; transformers;

D O I：

10.3390/s24185862

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.

引用

页数：23

共 50 条

[1] Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
Praveen, R. Gnana
Alam, Jahangir
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 444 - 458
[2] Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond
Li, Jiahong
Li, Chenda
Wu, Yifei
Qian, Yanmin
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1941 - 1953
[3] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
Takashima, Akihiko
Masumura, Ryo
Ando, Atsushi
Yamazaki, Yoshihiro
Uchida, Mihiro
Orihashi, Shota
INTERSPEECH 2022, 2022, : 4740 - 4744
[4] Emotion recognition using cross-modal attention from EEG and facial expression
Cui, Rongxuan
Chen, Wanzhong
Li, Mingyang
KNOWLEDGE-BASED SYSTEMS, 2024, 304
[5] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[6] Mi-CGA: Cross-modal Graph Attention Network for robust emotion recognition in the presence of incomplete modalities
Nguyen, Cam-Van Thi
Kieu, Hai-Dang
Ha, Quang-Thuy
Phan, Xuan-Hieu
Le, Duc-Trong
NEUROCOMPUTING, 2025, 623
[7] CATNet: Cross-modal fusion for audio-visual speech recognition
Wang, Xingmei
Mi, Jiachen
Li, Boquan
Zhao, Yixu
Meng, Jiaxiang
PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
[8] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
Tao, Ruijie
Das, Rohan Kumar
Li, Haizhou
INTERSPEECH 2020, 2020, : 2242 - 2246
[9] A CROSS-ATTENTION EMOTION RECOGNITION ALGORITHM BASED ON AUDIO AND VIDEO MODALITIES
Wu, Xiao
Mu, Xuan
Qi, Wen
Liu, Xiaorui
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 309 - 313
[10] Conversational Speech Recognition by Learning Audio-Textual Cross-Modal Contextual Representation
Wei, Kun
Li, Bei
Lv, Hang
Lu, Quan
Jiang, Ning
Xie, Lei
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 (2432-2444) : 2432 - 2444

← 1 2 3 4 5 →