Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition

被引:75
|
作者
Zhao, Ziping [1 ,3 ]
Bao, Zhongtian [1 ]
Zhao, Yiqin [1 ]
Zhang, Zixing [2 ]
Cummins, Nicholas [3 ]
Ren, Zhao [3 ]
Schuller, Bjorn [2 ,3 ,4 ]
机构
[1] Tianjin Normal Univ, Coll Comp & Informat Engn, Tianjin 300387, Peoples R China
[2] Imperial Coll London, GLAM, London SW7 2AZ, England
[3] Univ Augsburg, ZDB Chair Embedded Intelligence Hlth Care & Wellb, D-86159 Augsburg, Germany
[4] Tianjin Normal Univ, Int Res Ctr Affect Intelligence, Tianjin 300387, Peoples R China
基金
欧盟地平线“2020”; 中国国家自然科学基金;
关键词
Speech emotion recognition; bidirectional long short-term memory; fully convolutional networks; attention mechanism; spectrogram representation;
D O I
10.1109/ACCESS.2019.2928625
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human-machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional long short-term memory recurrent neural networks with attention-based fully convolutional networks (FCN). The extensive experiments were undertaken on the interactive emotional dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) to highlight the effectiveness of our approach. The experimental results indicate that deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1 % and a UA of 67.0 % on IEMOCAP, and 45.4% for UA on FAU-AEC dataset. Key results indicate that the extracted deep representations combined with a linear support vector classifier are comparable in performance with eGeMAPS and COMPARE, two standard acoustic feature representations.
引用
收藏
页码:97515 / 97525
页数:11
相关论文
共 50 条
  • [1] Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition
    Pham, Nhat Truong
    Dang, Duc Ngoc Minh
    Nguyen, Ngoc Duy
    Nguyen, Thanh Thi
    Nguyen, Hai
    Manavalan, Balachandran
    Lim, Chee Peng
    Nguyen, Sy Dzung
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 230
  • [2] Multiple attention convolutional-recurrent neural networks for speech emotion recognition
    Zhang, Zhihao
    Wang, Kunxia
    [J]. 2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2022,
  • [3] Speech Emotion Recognition via Generation using an Attention-based Variational Recurrent Neural Network
    Baruah, Murchana
    Banerjee, Bonny
    [J]. INTERSPEECH 2022, 2022, : 4710 - 4714
  • [4] An Attention-Based Convolutional Recurrent Neural Networks for Scene Text Recognition
    Alshawi, Adil Abdullah Abdulhussein
    Tanha, Jafar
    Balafar, Mohammad Ali
    [J]. IEEE ACCESS, 2024, 12 : 8123 - 8134
  • [5] DEEP CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ATTENTION MECHANISM FOR ROBUST SPEECH EMOTION RECOGNITION
    Huang, Che-Wei
    Narayanan, Shrikanth
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 583 - 588
  • [6] COMPACT CONVOLUTIONAL RECURRENT NEURAL NETWORKS VIA BINARIZATION FOR SPEECH EMOTION RECOGNITION
    Zhao, Huan
    Xiao, Yufeng
    Han, Jing
    Zhang, Zixing
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6690 - 6694
  • [7] Speech emotion recognition using wavelet packet reconstruction with attention-based deep recurrent neutral networks
    Meng, Hao
    Yan, Tianhao
    Wei, Hongwei
    Ji, Xun
    [J]. BULLETIN OF THE POLISH ACADEMY OF SCIENCES-TECHNICAL SCIENCES, 2021, 69 (01)
  • [8] Speech emotion recognition with deep convolutional neural networks
    Issa, Dias
    Demirci, M. Fatih
    Yazici, Adnan
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 59
  • [9] Convolutional-Recurrent Neural Networks With Multiple Attention Mechanisms for Speech Emotion Recognition
    Jiang, Pengxu
    Xu, Xinzhou
    Tao, Huawei
    Zhao, Li
    Zou, Cairong
    [J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (04) : 1564 - 1573
  • [10] Speech Emotion Recognition Using Convolutional-Recurrent Neural Networks with Attention Model
    Mu, Yawei
    Gomez, Hernandez
    Cano Montes, Antonio
    Alcaraz Martinez, Carlos
    Wang, Xuetian
    Gao, Hongmin
    [J]. 2ND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING, INFORMATION SCIENCE AND INTERNET TECHNOLOGY, CII 2017, 2017, : 341 - 350