TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

被引:0
|
作者
Zhao, Zhengdao [1 ]
Wang, Yuhua [1 ]
Shen, Guang [1 ]
Xu, Yuezhu [1 ]
Zhang, Jiayuan [2 ]
机构
[1] Harbin Engn Univ, High Performance Comp Res Ctr, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, High Performance Comp Lab, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Feature extraction; Transformers; Correlation; Data models; Speech recognition; Computer architecture; Deep-scale fusion transformer; multimodal embedding; multimodal emotion recognition; mutual correlation; mutual transformer;
D O I
10.1109/TASLP.2023.3316458
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.
引用
收藏
页码:3771 / 3782
页数:12
相关论文
共 50 条
  • [1] Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion
    Siriwardhana, Shamane
    Kaluarachchi, Tharindu
    Billinghurst, Mark
    Nanayakkara, Suranga
    [J]. IEEE ACCESS, 2020, 8 (08): : 176274 - 176285
  • [2] Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
    Xie, Baijun
    Sidulova, Mariia
    Park, Chung Hyuk
    [J]. SENSORS, 2021, 21 (14)
  • [3] Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning
    Le, Hoai-Duy
    Lee, Guee-Sang
    Kim, Soo-Hyung
    Kim, Seungwon
    Yang, Hyung-Jeong
    [J]. IEEE ACCESS, 2023, 11 : 14742 - 14751
  • [4] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [5] A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations
    Ma, Hui
    Wang, Jian
    Lin, Hongfei
    Zhang, Bo
    Zhang, Yijia
    Xu, Bo
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 776 - 788
  • [6] An emotion-driven, transformer-based network for multimodal fake news detection
    Yadav, Ashima
    Gupta, Anika
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (01)
  • [7] An emotion-driven, transformer-based network for multimodal fake news detection
    Ashima Yadav
    Anika Gupta
    [J]. International Journal of Multimedia Information Retrieval, 2024, 13
  • [8] Multimodal transformer augmented fusion for speech emotion recognition
    Wang, Yuanyuan
    Gu, Yu
    Yin, Yifei
    Han, Yingping
    Zhang, He
    Wang, Shuang
    Li, Chenyu
    Quan, Dou
    [J]. FRONTIERS IN NEUROROBOTICS, 2023, 17
  • [9] Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion Recognition
    Wu, Yujin
    Daoudi, Mohamed
    Amad, Ali
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (01) : 157 - 172
  • [10] Transformer-based ensemble deep learning model for EEG-based emotion recognition
    Xiaopeng Si
    Dong Huang
    Yulin Sun
    Shudi Huang
    He Huang
    Dong Ming
    [J]. Brain Science Advances, 2023, 9 (03) : 210 - 223