Multimodal Emotion Recognition Based on Hierarchical Fusion Strategy and Contextual Information Embedding

被引:0
|
作者
Sun M. [1 ]
Ouyang C. [1 ]
Liu Y. [1 ]
Ren L. [1 ]
机构
[1] School of Computing, University of South China, Hengyang
关键词
context information embedding; hierarchical fusion; noise interference;
D O I
10.13209/j.0479-8023.2024.034
中图分类号
学科分类号
摘要
Existing fusion strategies often involve simple concatenation of modal features, disregarding personalized fusion requirements based on the characteristics of each modality. Additionally, solely considering the emotions of individual utterances in isolation, without accounting for their emotional states within the context, can lead to errors in emotion recognition. To address the aforementioned issues, this paper proposes a multimodal emotion recognition method based on a layered fusion strategy and the incorporation of contextual information. The method employs a layered fusion strategy, progressively integrating different modal features in a hierarchical manner to reduce noise interference from individual modalities and address inconsistencies in expression across different modalities. It leverages the contextual information to comprehensively analyze the emotional representation of each utterance within the context, enhancing overall emotion recognition performance. In binary emotion classification tasks, the proposed method achieves a 1.54% improvement in accuracy compared with the state-of-the-art (SOTA) model. In multi-class emotion recognition tasks, the F1 score is improved by 2.79% compared to SOTA model. © 2024 Peking University. All rights reserved.
引用
收藏
页码:393 / 402
页数:9
相关论文
共 23 条
  • [1] O'Halloran K L., Interdependence, interaction and metaphor in multisemiotic texts, Social semiotics, 9, 3, pp. 317-354, (1999)
  • [2] Zbontar J, Li Jing, Misra I, Et al., Barlow twins: self-supervised learning via redundancy reduction, International Conference on Machine Learning. Online Meeting, pp. 12310-12320, (2021)
  • [3] Sharma G, Dhall A., A survey on automatic multimodal emotion recognition in the wild, Advances in Data Science: Methodologies and Applications, 189, pp. 35-64, (2021)
  • [4] Zhou Yu, Jun Yu, Cui Yuhao, Et al., Deep modular co-attention networks for visual question answering, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281-6290, (2019)
  • [5] Yan Wang, Jiayu Zhang, Ma Jun, Et al., Contextualized emotion recognition in conversation as sequence tagging, Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 186-195, (2020)
  • [6] Perez-Rosas V, Mihalcea R, Morency L P., Utterance-level multimodal sentiment analysis, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 973-982, (2013)
  • [7] Datcu D, Rothkrantz L J M., Semantic audiovisual data fusion for automatic emotion recognition, Emotion Recognition: A Pattern Analysis Approach, pp. 411-435, (2015)
  • [8] Hazarika D, Poria S, Zadeh A, Et al., Conversational memory network for emotion recognition in dyadic dialogue vieos, Proceedings of the conference, (2018)
  • [9] 36, 5, pp. 145-152, (2022)
  • [10] Shenoy A, Sardana A., Multilogue-net: a context aware RNN for multi-modal emotion detection and sentiment analysis in conversation [EB/OL]