MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION

被引:30
|
作者
Sun, Licai [1 ,2 ]
Liu, Bin [2 ]
Tao, Jianhua [1 ,2 ,3 ]
Lian, Zheng [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
speech emotion recognition; multimodal fusion; self-attention; cross-attention;
D O I
10.1109/ICASSP39728.2021.9414654
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech Emotion Recognition (SER) requires a thorough understanding of both the linguistic content of an utterance (i.e., textual information) and how the speaker utters it (i.e., acoustic information). The one vital challenge in SER is how to effectively fuse these two kinds of information. In this paper, we propose a novel Multimodal Cross- and Self-Attention Network (MCSAN) to tackle this problem. The core of MCSAN is to employ the parallel cross- and self-attention modules to explicitly model both inter- and intra-modal interactions of audio and text. Specifically, the cross-attention module utilizes the cross-attention mechanism to guide one modality to attend to the other modality and update the features accordingly. Similarly, the self-attention module employs the self-attention mechanism to propagate information within each modality. We evaluate MCSAN on two benchmark datasets, IEMOCAP and MELD. Experimental results demonstrate that our proposed model achieves state-of-the-art performance on both datasets.
引用
收藏
页码:4275 / 4279
页数:5
相关论文
共 50 条
  • [1] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    [J]. INTERSPEECH 2019, 2019, : 2578 - 2582
  • [2] Self-attention transfer networks for speech emotion recognition
    Ziping ZHAO
    Keru Wang
    Zhongtian BAO
    Zixing ZHANG
    Nicholas CUMMINS
    Shihuang SUN
    Haishuai WANG
    Jianhua TAO
    Bj?rn W.SCHULLER
    [J]. 虚拟现实与智能硬件(中英文), 2021, 3 (01) : 43 - 54
  • [3] BAT: Block and token self-attention for speech emotion recognition
    Lei, Jianjun
    Zhu, Xiangwei
    Wang, Ying
    [J]. Neural Networks, 2022, 156 : 67 - 80
  • [4] Multimodal cooperative self-attention network for action recognition
    Zhong, Zhuokun
    Hou, Zhenjie
    Liang, Jiuzhen
    Lin, En
    Shi, Haiyong
    [J]. IET IMAGE PROCESSING, 2023, 17 (06) : 1775 - 1783
  • [5] BAT: Block and token self-attention for speech emotion recognition
    Lei, Jianjun
    Zhu, Xiangwei
    Wang, Ying
    [J]. NEURAL NETWORKS, 2022, 156 : 67 - 80
  • [6] DILATED RESIDUAL NETWORK WITH MULTI-HEAD SELF-ATTENTION FOR SPEECH EMOTION RECOGNITION
    Li, Runnan
    Wu, Zhiyong
    Jia, Jia
    Zhao, Sheng
    Meng, Helen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6675 - 6679
  • [7] Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition
    Li, Chao
    Jiao, Jinlong
    Zhao, Yiqin
    Zhao, Ziping
    [J]. 2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW), 2019, : 105 - 109
  • [8] Speech emotion recognition using recurrent neural networks with directional self-attention
    Li, Dongdong
    Liu, Jinlin
    Yang, Zhuo
    Sun, Linyu
    Wang, Zhe
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [9] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [10] CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition
    Jiang, Dazhi
    Liu, Hao
    Wei, Runguo
    Tu, Geng
    [J]. COGNITIVE COMPUTATION, 2023, 15 (03) : 1082 - 1091