Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

被引:1
|
作者
Sun, Chenjing [1 ]
Zhou, Yi [2 ]
Huang, Xin [1 ]
Yang, Jichen [3 ]
Hou, Xianhua [1 ]
机构
[1] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528234, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117583, Singapore
[3] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou 510640, Peoples R China
关键词
speech emotion recognition (SER); wav2vec; 2.0; contrastive learning; MODEL;
D O I
10.3390/electronics13061103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
    Li, Feng
    Luo, Jiusong
    Xia, Wanjun
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
  • [2] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
    Pepino, Leonardo
    Riera, Pablo
    Ferrer, Luciana
    INTERSPEECH 2021, 2021, : 3400 - 3404
  • [3] Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
    Becerra, Helard
    Ragano, Alessandro
    Hines, Andrew
    INTERSPEECH 2022, 2022, : 4088 - 4092
  • [4] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
    Zhang, Yumei
    Jia, Maoshen
    Cao, Xuan
    Zhao, Zichen
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
  • [5] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
    Stefanel Gris, Lucas Rafael
    Casanova, Edresson
    de Oliveira, Frederico Santos
    Soares, Anderson da Silva
    Candido Junior, Arnaldo
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
  • [6] MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0
    Sharma, Mayank
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6907 - 6911
  • [7] FINE-TUNING WAV2VEC2 FOR SPEAKER RECOGNITION
    Vaessen, Nik
    Van Leeuwen, David A.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7967 - 7971
  • [8] Speech recognition model design for Sundanese language using WAV2VEC 2.0
    Cryssiover A.
    Zahra A.
    International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
  • [9] Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
    Park, Somin
    Mark, Mpabulungi
    Park, Bogyung
    Hong, Hyunki
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 1009 - 1030
  • [10] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
    Zhao, Zihan
    Wang, Yanfeng
    Wang, Yu
    INTERSPEECH 2022, 2022, : 4725 - 4729