Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition

被引:2
|
作者
Ahn, Chung-Soo [1 ]
Kasun, L. L. Chamara [2 ]
Sivadas, Sunil [3 ]
Rajapakse, Jagath C. [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, 50 Nanyang Ave, Singapore, Singapore
[2] Nanyang Technol Univ, Sch Elect & Elect Engn, 50 Nanyang Ave, Singapore, Singapore
[3] Natl Comp Syst, Singapore, Singapore
来源
关键词
computational paralinguistics; fusion of audio and text; human-computer interaction; multimodal fusion; speech emotion recognition;
D O I
10.21437/Interspeech.2022-888
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To infer emotions accurately from speech, fusion of audio and text is essential as words carry most information about semantics and emotions. Attention mechanism is essential component in multimodal fusion architecture as it dynamically pairs different regions within multimodal sequences. However, existing architecture lacks explicit structure to model dynamics between fused representations. Thus we propose recurrent multi-head attention in a fusion architecture, which selects salient fused representations and learns dynamics between them. Multiple 2D attention layers select salient pairs among all possible pairs of audio and text representations, which are combined with fusion operation. Lastly, multiple fused representations are fed into recurrent unit to learn dynamics between fused representations. Our method outperforms existing approaches for fusion of audio and text for speech emotion recognition and achieves state-of-the-art accuracies on benchmark IEMOCAP dataset.
引用
收藏
页码:744 / 748
页数:5
相关论文
共 50 条
  • [1] Multi-head attention fusion networks for multi-modal speech emotion recognition
    Zhang, Junfeng
    Xing, Lining
    Tan, Zhen
    Wang, Hongsen
    Wang, Kesheng
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 168
  • [2] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
    Ngoc-Huynh Ho
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    [J]. IEEE ACCESS, 2020, 8 : 61672 - 61686
  • [3] DILATED RESIDUAL NETWORK WITH MULTI-HEAD SELF-ATTENTION FOR SPEECH EMOTION RECOGNITION
    Li, Runnan
    Wu, Zhiyong
    Jia, Jia
    Zhao, Sheng
    Meng, Helen
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6675 - 6679
  • [4] MULTI-HEAD ATTENTION FOR SPEECH EMOTION RECOGNITION WITH AUXILIARY LEARNING OF GENDER RECOGNITION
    Nediyanchath, Anish
    Paramasivam, Periyasamy
    Yenigalla, Promod
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7179 - 7183
  • [5] Using Recurrent Neural Network Structure and Multi-Head Attention with Convolution for Fraudulent Phone Text Recognition
    Zhou, Junjie
    Xu, Hongkui
    Zhang, Zifeng
    Lu, Jiangkun
    Guo, Wentao
    Li, Zhenye
    [J]. Computer Systems Science and Engineering, 2023, 46 (02): : 2277 - 2297
  • [6] EEG-Based Emotion Recognition Using Convolutional Recurrent Neural Network with Multi-Head Self-Attention
    Hu, Zhangfang
    Chen, Libujie
    Luo, Yuan
    Zhou, Jingfan
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (21):
  • [7] Hybrid neural network model based on multi-head attention for English text emotion analysis
    Li, Ping
    [J]. EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2022, 9 (35)
  • [8] Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion
    Xu, Mingke
    Zhang, Fan
    Khan, Samee U.
    [J]. 2020 10TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2020, : 1058 - 1064
  • [9] Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
    Xu, Xinmeng
    Wang, Yang
    Jia, Jie
    Chen, Binbin
    Li, Dejun
    [J]. INTERSPEECH 2022, 2022, : 971 - 975
  • [10] A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech emotion recognition
    Tang, Yuwu
    Hu, Ying
    He, Liang
    Huang, Hao
    [J]. SPEECH COMMUNICATION, 2022, 143 : 21 - 32