Multi-type features separating fusion learning for Speech Emotion Recognition

被引:12
|
作者
Xu, Xinlei [1 ,2 ]
Li, Dongdong [2 ]
Zhou, Yijun [2 ]
Wang, Zhe [1 ,2 ]
机构
[1] East China Univ Sci Technol, Key Lab Smart Mfg Energy Chem Proc, Minist Educ, Shanghai 200237, Peoples R China
[2] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Hybrid feature selection; Feature-level fusion; Speaker-independent; CONVOLUTIONAL NEURAL-NETWORKS; GMM; REPRESENTATIONS; CLASSIFICATION; ADAPTATION; RECURRENT; CNN;
D O I
10.1016/j.asoc.2022.109648
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) is a challengeable task to improve human-computer interaction. Speech data have different representations, and choosing the appropriate features to express the emotion behind the speech is difficult. The human brain can comprehensively judge the same thing in different dimensional representations to obtain the final result. Inspired by this, we believe that it is reasonable to have complementary advantages between different representations of speech data. Therefore, a Hybrid Deep Learning with Multi-type features Model (HD-MFM) is proposed to integrate the acoustic, temporal and image information of speech. Specifically, we utilize Convolutional Neural Network (CNN) to extract image information from the spectrogram of speech. Deep Neural Network (DNN) is used for extracting the acoustic information from the statistic features of speech. Then, Long Short-Term Memory (LSTM) is chosen to extract the temporal information from the Mel-Frequency Cepstral Coefficients (MFCC) of speech. Finally, three different types of speech features are concatenated together to get a richer emotion representation with better discriminative property. Considering that different fusion strategies affect the relationship between features, we consider two fusion strategies in this paper named separating and merging. To evaluate the feasibility and effectiveness of the proposed HD-MFM, we perform extensive experiments on EMO-DB and IEMOCAP of SER. The experimental results show that the separating method has more significant advantages in feature complementarity. The proposed HD-MFM obtains 91.25% and 72.02% results on EMO-DB and IEMOCAP. The obtained results indicate the proposed HD-MFM can make full use of the effective complementary feature representations by separating strategy to further enhance the performance of SER. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Learning Transferable Features for Speech Emotion Recognition
    Marczewski, Alison
    Veloso, Adriano
    Ziviani, Nivio
    PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 529 - 536
  • [2] Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
    Kakuba, Samuel
    Poulose, Alwin
    Han, Dong Seog
    IEEE ACCESS, 2022, 10 : 125538 - 125551
  • [3] Multi-algorithm Fusion for Speech Emotion Recognition
    Verma, Gyanendra K.
    Tiwary, U. S.
    Agrawal, Shaishav
    ADVANCES IN COMPUTING AND COMMUNICATIONS, PT III, 2011, 192 : 452 - 459
  • [4] A deep learning framework for predicting molecular property based on multi-type features fusion
    Ma, Mei
    Lei, Xiujuan
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
  • [5] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [6] Multi-Features Audio Extraction for Speech Emotion Recognition Based on Deep Learning
    Gondohanindijo, Jutono
    Muljono
    Noersasongko, Edi
    Pujiono
    Setiadi, De Rosal Moses
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (06) : 198 - 206
  • [7] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
    Chunyi Wang
    Ying Ren
    Na Zhang
    Fuwei Cui
    Shiying Luo
    Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
  • [8] Speech Emotion Recognition by Late Fusion of Linguistic and Acoustic Features using Deep Learning Models
    Sato, Kiyohide
    Kishi, Keita
    Kosaka, Tetsuo
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1013 - 1018
  • [9] Speech Emotion Recognition with Multi-task Learning
    Cai, Xingyu
    Yuan, Jiahong
    Zheng, Renjie
    Huang, Liang
    Church, Kenneth
    INTERSPEECH 2021, 2021, : 4508 - 4512
  • [10] Multi-task Learning for Speech Emotion and Emotion Intensity Recognition
    Yue, Pengcheng
    Qu, Leyuan
    Zheng, Shukai
    Li, Taihao
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1232 - 1237