Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion

被引:36
|
作者
Xu, Mingke [1 ]
Zhang, Fan [2 ]
Khan, Samee U. [3 ]
机构
[1] Nanjing Tech Univ, Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
[2] IBM Massachusette Lab, IBM Watson Grp, Littleton, MA USA
[3] North Dakota State Univ, Elect & Comp Eng, Fargo, ND USA
基金
美国国家科学基金会;
关键词
speech emotion recognition; convolutional neural network; attention mechanism; pattern recognition; machine Learning; CLASSIFICATION; MODEL;
D O I
10.1109/ccwc47524.2020.9031207
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.
引用
收藏
页码:1058 / 1064
页数:7
相关论文
共 50 条
  • [41] EFFECTIVE ATTENTION MECHANISM IN DYNAMIC MODELS FOR SPEECH EMOTION RECOGNITION
    Hsiao, Po-Wei
    Chen, Chia-Ping
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2526 - 2530
  • [42] Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition
    Gao, Miao
    Yang, Chun
    Zhou, Fang
    Yin, Xu-cheng
    [J]. INTERSPEECH 2019, 2019, : 3930 - 3934
  • [44] Spatiotemporal and frequential cascaded attention networks for speech emotion recognition
    Li, Shuzhen
    Xing, Xiaofen
    Fan, Weiquan
    Cai, Bolun
    Fordson, Perry
    Xu, Xiangmin
    [J]. NEUROCOMPUTING, 2021, 448 : 238 - 248
  • [45] Speech Emotion Recognition using XGBoost and CNN BLSTM with Attention
    He, Jingru
    Ren, Liyong
    [J]. 2021 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, INTERNET OF PEOPLE, AND SMART CITY INNOVATIONS (SMARTWORLD/SCALCOM/UIC/ATC/IOP/SCI 2021), 2021, : 154 - 159
  • [46] Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
    Lee, Sanghyun
    Han, David K.
    Ko, Hanseok
    [J]. SENSORS, 2020, 20 (22) : 1 - 19
  • [47] Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition
    Lei, Jianjun
    Wang, Jing
    Wang, Ying
    [J]. APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8478 - 8490
  • [48] Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers
    Lefter, Iulia
    Rothkrantz, Leon J. M.
    Wiggers, Pascal
    van Leeuwen, David A.
    [J]. TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 353 - +
  • [49] A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition
    Hu, Dongni
    Chen, Chengxin
    Zhang, Pengyuan
    Li, Junfeng
    Yan, Yonghong
    Zhao, Qingwei
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) : 1391 - 1394
  • [50] Deep Feature Extraction and Attention Fusion for Multimodal Emotion Recognition
    Yang, Zhiyi
    Li, Dahua
    Hou, Fazheng
    Song, Yu
    Gao, Qiang
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (03) : 1526 - 1530