A dynamic-static feature fusion learning network for speech emotion recognition

被引:0
|
作者
Xue, Peiyun [1 ,2 ]
Gao, Xiang [1 ]
Bai, Jing [1 ]
Dong, Zhenan [1 ]
Wang, Zhiyu [1 ]
Xu, Jiangshuai [1 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China
[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China
关键词
Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;
D O I
10.1016/j.neucom.2025.129836
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Multimodal emotion recognition from facial expression and speech based on feature fusion
    Tang, Guichen
    Xie, Yue
    Li, Ke
    Liang, Ruiyu
    Zhao, Li
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (11) : 16359 - 16373
  • [42] Multimodal emotion recognition from facial expression and speech based on feature fusion
    Guichen Tang
    Yue Xie
    Ke Li
    Ruiyu Liang
    Li Zhao
    Multimedia Tools and Applications, 2023, 82 : 16359 - 16373
  • [43] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    INTERSPEECH 2022, 2022, : 1988 - 1992
  • [44] Teager_Mel and PLP Fusion Feature Based Speech Emotion Recognition
    Chen, Xiao
    Li, Haifeng
    Ma, Lin
    Liu, Xinlei
    Chen, Jing
    2015 FIFTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC), 2015, : 1109 - 1114
  • [45] Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech
    Planet, Santiago
    Iriondo, Ignasi
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2012, 1 (06): : 44 - 51
  • [46] Feature Selection Based Transfer Subspace Learning for Speech Emotion Recognition
    Song, Peng
    Zheng, Wenming
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) : 373 - 382
  • [47] Speech Emotion Recognition using Feature Selection with Adaptive Structure Learning
    Rayaluru, Akshay
    Bandela, Surekha Reddy
    Kumar, T. Kishore
    2019 IEEE INTERNATIONAL SYMPOSIUM ON SMART ELECTRONIC SYSTEMS (ISES 2019), 2019, : 233 - 236
  • [48] Joint subspace learning and feature selection method for speech emotion recognition
    Song P.
    Zheng W.
    Zhao L.
    2018, Tsinghua University (58): : 347 - 351
  • [49] UNSUPERVISED LEARNING APPROACH TO FEATURE ANALYSIS FOR AUTOMATIC SPEECH EMOTION RECOGNITION
    Eskimez, Sefik Emre
    Duan, Zhiyao
    Heinzelman, Wendi
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5099 - 5103
  • [50] Surface roughness prediction based on fusion of dynamic-static data
    Wang, Jiayi
    Wu, Xingfu
    Huang, Qiangfei
    Mu, Quanchen
    Yang, Wenjie
    Yang, Hua
    Li, Zirui
    MEASUREMENT, 2025, 243