Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion

被引:36
|
作者
Xu, Mingke [1 ]
Zhang, Fan [2 ]
Khan, Samee U. [3 ]
机构
[1] Nanjing Tech Univ, Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
[2] IBM Massachusette Lab, IBM Watson Grp, Littleton, MA USA
[3] North Dakota State Univ, Elect & Comp Eng, Fargo, ND USA
基金
美国国家科学基金会;
关键词
speech emotion recognition; convolutional neural network; attention mechanism; pattern recognition; machine Learning; CLASSIFICATION; MODEL;
D O I
10.1109/ccwc47524.2020.9031207
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.
引用
收藏
页码:1058 / 1064
页数:7
相关论文
共 50 条
  • [1] Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset
    Xu, Mingke
    Zhang, Fan
    Zhang, Wei
    [J]. IEEE ACCESS, 2021, 9 : 74539 - 74549
  • [2] Multi-head attention fusion networks for multi-modal speech emotion recognition
    Zhang, Junfeng
    Xing, Lining
    Tan, Zhen
    Wang, Hongsen
    Wang, Kesheng
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 168
  • [3] Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition
    Ahn, Chung-Soo
    Kasun, L. L. Chamara
    Sivadas, Sunil
    Rajapakse, Jagath C.
    [J]. INTERSPEECH 2022, 2022, : 744 - 748
  • [4] MULTI-HEAD ATTENTION FOR SPEECH EMOTION RECOGNITION WITH AUXILIARY LEARNING OF GENDER RECOGNITION
    Nediyanchath, Anish
    Paramasivam, Periyasamy
    Yenigalla, Promod
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7179 - 7183
  • [5] A speech emotion recognition method for the elderly based on feature fusion and attention mechanism
    Jian, Qijian
    Xiang, Min
    Huang, Wei
    [J]. THIRD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION; NETWORK AND COMPUTER TECHNOLOGY (ECNCT 2021), 2022, 12167
  • [6] MSER: Multimodal speech emotion recognition using cross-attention with deep fusion
    Khan, Mustaqeem
    Gueaieb, Wail
    El Saddik, Abdulmotaleb
    Kwon, Soonil
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [7] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    [J]. INTERSPEECH 2019, 2019, : 2578 - 2582
  • [8] The Impact of Attention Mechanisms on Speech Emotion Recognition
    Chen, Shouyan
    Zhang, Mingyan
    Yang, Xiaofen
    Zhao, Zhijia
    Zou, Tao
    Sun, Xinqi
    [J]. SENSORS, 2021, 21 (22)
  • [9] CLASSIFIER FUSION FOR EMOTION RECOGNITION FROM SPEECH
    Scherer, Stefan
    Schwenker, Friedhelm
    Palm, Guenther
    [J]. ADVANCED INTELLIGENT ENVIRONMENTS, 2009, : 95 - 117
  • [10] Anchor Model Fusion for Emotion Recognition in Speech
    Ortego-Resa, Carlos
    Lopez-Moreno, Ignacio
    Ramos, Daniel
    Gonzalez-Rodriguez, Joaquin
    [J]. BIOMETRIC ID MANAGEMENT AND MULTIMODAL COMMUNICATION, PROCEEDINGS, 2009, 5707 : 49 - 56