A dynamic-static feature fusion learning network for speech emotion recognition

被引:0
|
作者
Xue, Peiyun [1 ,2 ]
Gao, Xiang [1 ]
Bai, Jing [1 ]
Dong, Zhenan [1 ]
Wang, Zhiyu [1 ]
Xu, Jiangshuai [1 ]
机构
[1] Taiyuan Univ Technol, Coll Elect Informat Engn, Taiyuan 030024, Peoples R China
[2] Shanxi Acad Adv Res & Innovat, Taiyuan 030032, Peoples R China
关键词
Speech emotion recognition; Multi-feature Learning Network; Dynamic-Static feature fusion; Hybrid feature representation; Attention mechanism; Cross-corpus; RECURRENT;
D O I
10.1016/j.neucom.2025.129836
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech is a paramount mode of human communication, and enhancing the quality and fluency of HumanComputer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] An autoencoder-based feature level fusion for speech emotion recognition
    Peng, Shixin
    Kai, Chen
    Tian, Tian
    Chen, Jingying
    DIGITAL COMMUNICATIONS AND NETWORKS, 2024, 10 (05) : 1341 - 1351
  • [32] Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder
    Ying, Yangwei
    Tu, Yuanwu
    Zhou, Hong
    ELECTRONICS, 2021, 10 (17)
  • [33] Learning Local to Global Feature Aggregation for Speech Emotion Recognition
    Lu, Cheng
    Lian, Hailun
    Zheng, Wenming
    Zong, Yuan
    Zhao, Yan
    Li, Sunan
    INTERSPEECH 2023, 2023, : 1908 - 1912
  • [34] Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
    Li, Feng
    Luo, Jiusong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT II, 2023, 14087 : 211 - 221
  • [35] IEDSFAN: information enhancement and dynamic-static fusion attention network for traffic flow forecasting
    Yu, Lianfei
    Wang, Ziling
    Yang, Wenxi
    Qu, Zhijian
    Ren, Chongguang
    COMPLEX & INTELLIGENT SYSTEMS, 2025, 11 (01)
  • [36] A cross-modal fusion network based on graph feature learning for multimodal emotion recognition
    Cao Xiaopeng
    Zhang Linying
    Chen Qiuxian
    Ning Hailong
    Dong Yizhuo
    The Journal of China Universities of Posts and Telecommunications, 2024, 31 (06) : 16 - 25
  • [37] Feature representation for speech emotion Recognition
    Abdollahpour, Mehdi
    Zamani, Lafar
    Rad, Hamidreza Saligheh
    2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1465 - 1468
  • [38] Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network
    Jiang, Wei
    Wang, Zheng
    Jin, Jesse S.
    Han, Xianfeng
    Li, Chunguang
    SENSORS, 2019, 19 (12)
  • [39] A FEATURE SELECTION AND FEATURE FUSION COMBINATION METHOD FOR SPEAKER-INDEPENDENT SPEECH EMOTION RECOGNITION
    Jin, Yun
    Song, Peng
    Zheng, Wenming
    Zhao, Li
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [40] A speech emotion recognition method for the elderly based on feature fusion and attention mechanism
    Jian, Qijian
    Xiang, Min
    Huang, Wei
    THIRD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION; NETWORK AND COMPUTER TECHNOLOGY (ECNCT 2021), 2022, 12167