A dynamic-static feature fusion learning network for speech emotion recognition

被引:0
|
作者
Xue, Peiyun [1 ,2 ]
Gao, Xiang [1 ]
Bai, Jing [1 ]
Dong, Zhenan [1 ]
Wang, Zhiyu [1 ]
Xu, Jiangshuai [1 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China
[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China
关键词
Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;
D O I
10.1016/j.neucom.2025.129836
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Novel feature fusion method for speech emotion recognition based on multiple kernel learning
    Zhao, L. (zhaoli@seu.edu.cn), 1600, Southeast University (29):
  • [22] A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
    Tu, Zhongwen
    Liu, Bin
    Zhao, Wei
    Yan, Raoxin
    Zou, Yang
    APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [23] Speech Emotion Recognition Based on Multi Acoustic Feature Fusion
    Xiang, Shanshan
    Anwer, Sadiyagul
    Yilahun, Hankiz
    Hamdulla, Askar
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 338 - 346
  • [24] Speech emotion recognition based on multimodal and multiscale feature fusion
    Hu, Huangshui
    Wei, Jie
    Sun, Hongyu
    Wang, Chuhang
    Tao, Shuo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [25] SDTF-Net: Static and dynamic time-frequency network for Speech Emotion Recognition
    Liu, Lu-Yao
    Liu, Wen-Zhe
    Feng, Lin
    SPEECH COMMUNICATION, 2023, 148 : 1 - 8
  • [26] An autoencoder-based feature level fusion for speech emotion recognition
    Peng Shixin
    Chen Kai
    Tian Tian
    Chen Jingying
    Digital Communications and Networks, 2024, 10 (05) : 1341 - 1351
  • [27] Research on Feature Fusion Speech Emotion Recognition Technology for Smart Teaching
    Zhang, Shaoyun
    Li, Chao
    MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [28] An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition
    Sekkate, Sara
    Khalil, Mohammed
    Adib, Abdellah
    Ben Jebara, Sofia
    COMPUTERS, 2019, 8 (04)
  • [29] Speech emotion recognition based on multi‐feature and multi‐lingual fusion
    Chunyi Wang
    Ying Ren
    Na Zhang
    Fuwei Cui
    Shiying Luo
    Multimedia Tools and Applications, 2022, 81 : 4897 - 4907
  • [30] Multi-feature Fusion Speech Emotion Recognition Based on SVM
    Zeng, Xiaoping
    Dong, Li
    Chen, Guanghui
    Dong, Qi
    PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 77 - 80