MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION

被引:30
|
作者
Sun, Licai [1 ,2 ]
Liu, Bin [2 ]
Tao, Jianhua [1 ,2 ,3 ]
Lian, Zheng [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
speech emotion recognition; multimodal fusion; self-attention; cross-attention;
D O I
10.1109/ICASSP39728.2021.9414654
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech Emotion Recognition (SER) requires a thorough understanding of both the linguistic content of an utterance (i.e., textual information) and how the speaker utters it (i.e., acoustic information). The one vital challenge in SER is how to effectively fuse these two kinds of information. In this paper, we propose a novel Multimodal Cross- and Self-Attention Network (MCSAN) to tackle this problem. The core of MCSAN is to employ the parallel cross- and self-attention modules to explicitly model both inter- and intra-modal interactions of audio and text. Specifically, the cross-attention module utilizes the cross-attention mechanism to guide one modality to attend to the other modality and update the features accordingly. Similarly, the self-attention module employs the self-attention mechanism to propagate information within each modality. We evaluate MCSAN on two benchmark datasets, IEMOCAP and MELD. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on both datasets.
引用
收藏
页码:4275 / 4279
页数:5
相关论文
共 50 条
  • [41] EEG-Based Emotion Recognition With Emotion Localization via Hierarchical Self-Attention
    Zhang, Yuzhe
    Liu, Huan
    Zhang, Dalin
    Chen, Xuxu
    Qin, Tao
    Zheng, Qinghua
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (03) : 2458 - 2469
  • [42] Emotion embedding framework with emotional self-attention mechanism for speaker recognition
    Li, Dongdong
    Yang, Zhuo
    Liu, Jinlin
    Yang, Hai
    Wang, Zhe
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [43] Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition
    Zhao, Ziping
    Li, Qifei
    Zhang, Zixing
    Cummins, Nicholas
    Wang, Haishuai
    Tao, Jianhua
    Schuller, Bjoern W.
    [J]. NEURAL NETWORKS, 2021, 141 : 52 - 60
  • [44] Toward Auto-Modeling of Formal Verification for NextG Protocols: A Multimodal Cross- and Self-Attention Large Language Model Approach
    Yang, Jingda
    Wang, Ying
    [J]. IEEE ACCESS, 2024, 12 : 27858 - 27869
  • [45] Attention Based Fully Convolutional Network for Speech Emotion Recognition
    Zhang, Yuanyuan
    Du, Jun
    Wang, Zirui
    Zhang, Jianshu
    Tu, Yanhui
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1771 - 1775
  • [46] A Joint Network Based on Interactive Attention for Speech Emotion Recognition
    Hu, Ying
    Hou, Shijing
    Yang, Huamin
    Huang, Hao
    He, Liang
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1715 - 1720
  • [47] A Fast Convolutional Self-attention Based Speech Dereverberation Method for Robust Speech Recognition
    Li, Nan
    Ge, Meng
    Wang, Longbiao
    Dang, Jianwu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 295 - 305
  • [48] Predicting Esophageal Fistula Risks Using a Multimodal Self-attention Network
    Guan, Yulu
    Cui, Hui
    Xu, Yiyue
    Jin, Qiangguo
    Feng, Tian
    Tu, Huawei
    Xuan, Ping
    Li, Wanlong
    Wang, Linlin
    Duh, Been-Lirn
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 721 - 730
  • [49] Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
    Lee, Chan Woo
    Song, Kyu Ye
    Jeong, Jihoon
    Choi, Woo Yong
    [J]. FIRST GRAND CHALLENGE AND WORKSHOP ON HUMAN MULTIMODAL LANGUAGE (CHALLENGE-HML), 2018, : 28 - 34
  • [50] Masked face recognition with convolutional visual self-attention network
    Ge, Yiming
    Liu, Hui
    Du, Junzhao
    Li, Zehua
    Wei, Yuheng
    [J]. NEUROCOMPUTING, 2023, 518 : 496 - 506