A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition

被引:1
|
作者
Hu, Dongni [1 ,2 ]
Chen, Chengxin [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
Li, Junfeng [1 ,2 ]
Yan, Yonghong [1 ,2 ]
Zhao, Qingwei [1 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100864, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100864, Peoples R China
关键词
speech emotion recognition; multi-modal fusion; attention mechanism; end-to-end;
D O I
10.1587/transinf.2021EDL8002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
引用
收藏
页码:1391 / 1394
页数:4
相关论文
共 50 条
  • [21] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [22] Multi-modal fusion network with complementarity and importance for emotion recognition
    Liu, Shuai
    Gao, Peng
    Li, Yating
    Fu, Weina
    Ding, Weiping
    INFORMATION SCIENCES, 2023, 619 : 679 - 694
  • [23] Research on Emotion Classification Based on Multi-modal Fusion
    Xiang, Zhihua
    Radzi, Nor Haizan Mohamed
    Hashim, Haslina
    BAGHDAD SCIENCE JOURNAL, 2024, 21 (02) : 548 - 560
  • [24] Multi-Modal Emotion Recognition From Speech and Facial Expression Based on Deep Learning
    Cai, Linqin
    Dong, Jiangong
    Wei, Min
    2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 5726 - 5729
  • [25] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
    Nie, Weizhi
    Yan, Yan
    Song, Dan
    Wang, Kun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16205 - 16214
  • [26] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
    Weizhi Nie
    Yan Yan
    Dan Song
    Kun Wang
    Multimedia Tools and Applications, 2021, 80 : 16205 - 16214
  • [27] Multi-modal Emotion Recognition using Speech Features and Text Embedding
    Kim J.-H.
    Lee S.-P.
    Transactions of the Korean Institute of Electrical Engineers, 2021, 70 (01): : 108 - 113
  • [28] A novel signal channel attention network for multi-modal emotion recognition
    Du, Ziang
    Ye, Xia
    Zhao, Pujie
    FRONTIERS IN NEUROROBOTICS, 2024, 18
  • [29] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [30] Lightweight multi-modal emotion recognition model based on modal generation
    Liu, Peisong
    Che, Manqiang
    Luo, Jiangchuan
    2022 9TH INTERNATIONAL FORUM ON ELECTRICAL ENGINEERING AND AUTOMATION, IFEEA, 2022, : 430 - 435