Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

被引:19
|
作者
Zhang, Hua [1 ,2 ]
Gou, Ruoyun [1 ]
Shang, Jili [1 ]
Shen, Fangyao [1 ]
Wu, Yifan [1 ,3 ]
Dai, Guojun [1 ]
机构
[1] HangZhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[2] Zhejiang Univ, Key Lab Network Multimedia Technol Zhejiang Prov, Hangzhou, Peoples R China
[3] HangzhouDianzi Univ, Key Lab Brain Machine Collaborat Intelligence Zhe, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
speech emotion recognition; deep convolutional neural network; attention mechanism; long short-term memory; deep neural network; FEATURES;
D O I
10.3389/fphys.2021.643202
中图分类号
Q4 [生理学];
学科分类号
071003 ;
摘要
Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] EEG emotion recognition based on the attention mechanism and pre-trained convolution capsule network
    Liu, Shuaiqi
    Wang, Zeyao
    An, Yanling
    Zhao, Jie
    Zhao, Yingying
    Zhang, Yu-Dong
    KNOWLEDGE-BASED SYSTEMS, 2023, 265
  • [2] A Novel Policy for Pre-trained Deep Reinforcement Learning for Speech Emotion Recognition
    Rajapakshe, Thejan
    Rana, Rajib
    Khalifa, Sara
    Liu, Jiajun
    Schuller, Bjorn
    2022 AUSTRALIAN COMPUTER SCIENCE WEEK (ACSW 2022), 2022, : 96 - 105
  • [3] Automatic Topic Labeling model with Paired-Attention based on Pre-trained Deep Neural Network
    He, Dongbin
    Ren, Yanzhao
    Khattak, Abdul Mateen
    Liu, Xinliang
    Tao, Sha
    Gao, Wanlin
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [4] Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
    Minh Tran
    Yin, Yufeng
    Soleymani, Mohammad
    INTERSPEECH 2023, 2023, : 636 - 640
  • [5] Image Hashing by Pre-Trained Deep Neural Network
    Li Pingyuan
    Zhang Dan
    Yuan Xiaoguang
    Jiang Suiping
    2022 ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING (CACML 2022), 2022, : 468 - 471
  • [6] Fast Learning for Accurate Object Recognition Using a Pre-trained Deep Neural Network
    Lobato-Rios, Victor
    Tenorio-Gonzalez, Ana C.
    Morales, Eduardo F.
    ADVANCES IN SOFT COMPUTING, MICAI 2017, PT I, 2018, 10632 : 41 - 53
  • [7] On the Usage of Pre-Trained Speech Recognition Deep Layers to Detect Emotions
    Oliveira, Jorge
    Praca, Isabel
    IEEE ACCESS, 2021, 9 : 9699 - 9705
  • [8] MTLSER: Multi-task learning enhanced speech emotion recognition with pre-trained acoustic model
    Chen, Zengzhao
    Liu, Chuan
    Wang, Zhifeng
    Zhao, Chuanxu
    Lin, Mengting
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [9] Unsupervised pre-trained filter learning approach for efficient convolution neural network
    Rehman, Sadaqat Ur
    Tu, Shanshan
    Waqas, Muhammad
    Huang, Yongfeng
    Rehman, Obaid Ur
    Ahmad, Basharat
    Ahmad, Salman
    NEUROCOMPUTING, 2019, 365 : 171 - 190
  • [10] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42