Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

被引：28

作者：

Le, Hoai-Duy ^{[1
]}

Lee, Guee-Sang ^{[1
]}

Kim, Soo-Hyung ^{[1
]}

Kim, Seungwon ^{[1
]}

Yang, Hyung-Jeong ^{[1
]}

机构：

[1] Chonnam Natl Univ, Dept Artificial Intelligence Convergence, Gwangju 61186, South Korea

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Transformers; Emotion recognition; Feature extraction; Task analysis; Visualization; Deep learning; Sentiment analysis; Multimodal fusion; multi-label video emotion recognition; transformers;

D O I：

10.1109/ACCESS.2023.3244390

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Emotion recognition has been an active research area for a long time. Recently, multimodal emotion recognition from video data has grown in importance with the explosion of video content due to the emergence of short video social media platforms. Effectively incorporating information from multiple modalities in video data to learn robust multimodal representation for improving recognition model performance is still the primary challenge for researchers. In this context, transformer architectures have been widely used and have significantly improved multimodal deep learning and representation learning. Inspired by this, we propose a transformer-based fusion and representation learning method to fuse and enrich multimodal features from raw videos for the task of multi-label video emotion recognition. Specifically, our method takes raw video frames, audio signals, and text subtitles as inputs and passes information from these multiple modalities through a unified transformer architecture for learning a joint multimodal representation. Moreover, we use the label-level representation approach to deal with the multi-label classification task and enhance the model performance. We conduct experiments on two benchmark datasets: Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) to evaluate our proposed method. The experimental results demonstrate that the proposed method outperforms other strong baselines and existing approaches for multi-label video emotion recognition.

引用

页码：14742 / 14751

页数：10

共 50 条

[1] Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition
Wu, Yujin
Daoudi, Mohamed
Amad, Ali
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (01) : 157 - 172
[2] Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion
Siriwardhana, Shamane
Kaluarachchi, Tharindu
Billinghurst, Mark
Nanayakkara, Suranga
IEEE ACCESS, 2020, 8 (08): : 176274 - 176285
[3] Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection
Ju, Xincheng
Zhang, Dong
Li, Junhui
Zhou, Guodong
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 512 - 520
[4] Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
Xie, Baijun
Sidulova, Mariia
Park, Chung Hyuk
SENSORS, 2021, 21 (14)
[5] TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition
Zhao, Zhengdao
Wang, Yuhua
Shen, Guang
Xu, Yuezhu
Zhang, Jiayuan
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3771 - 3782
[6] Speech Emotion Recognition based on Multi-Label Emotion Existence Model
Ando, Atsushi
Masumura, Ryo
Kamiyama, Havana
Kobashikawa, Satoshi
Aono, Yushi
INTERSPEECH 2019, 2019, : 2818 - 2822
[7] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
Huang, Jian
Tao, Jianhua
Liu, Bin
Lian, Zheng
Niu, Mingyue
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
[8] Multimodal Transformer Fusion for Emotion Recognition: A Survey
Belaref, Amdjed
Seguier, Renaud
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 107 - 113
[9] MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion
Khan, Mustaqeem
Tran, Phuong-Nam
Pham, Nhat Truong
El Saddik, Abdulmotaleb
Othmani, Alice
SCIENTIFIC REPORTS, 2025, 15 (01):
[10] Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition
Deng, James J.
Leung, Clement H. C.
BRAIN INFORMATICS, BI 2021, 2021, 12960 : 179 - 188

← 1 2 3 4 5 →