Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

被引：1

作者：

Sun, Chenjing ^{[1
]}

Zhou, Yi ^{[2
]}

Huang, Xin ^{[1
]}

Yang, Jichen ^{[3
]}

Hou, Xianhua ^{[1
]}

机构：

[1] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528234, Peoples R China

[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117583, Singapore

[3] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou 510640, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 06期

关键词：

speech emotion recognition (SER); wav2vec; 2.0; contrastive learning; MODEL;

D O I：

10.3390/electronics13061103

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively.

引用

页数：19

共 50 条

[1] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
Li, Feng
Luo, Jiusong
Xia, Wanjun
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
[2] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
Pepino, Leonardo
Riera, Pablo
Ferrer, Luciana
INTERSPEECH 2021, 2021, : 3400 - 3404
[3] Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
Becerra, Helard
Ragano, Alessandro
Hines, Andrew
INTERSPEECH 2022, 2022, : 4088 - 4092
[4] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
Zhang, Yumei
Jia, Maoshen
Cao, Xuan
Zhao, Zichen
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
[5] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
Stefanel Gris, Lucas Rafael
Casanova, Edresson
de Oliveira, Frederico Santos
Soares, Anderson da Silva
Candido Junior, Arnaldo
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
[6] MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0
Sharma, Mayank
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6907 - 6911
[7] FINE-TUNING WAV2VEC2 FOR SPEAKER RECOGNITION
Vaessen, Nik
Van Leeuwen, David A.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7967 - 7971
[8] Speech recognition model design for Sundanese language using WAV2VEC 2.0
Cryssiover A.
Zahra A.
International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
[9] Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
Park, Somin
Mark, Mpabulungi
Park, Bogyung
Hong, Hyunki
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 1009 - 1030
[10] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zhao, Zihan
Wang, Yanfeng
Wang, Yu
INTERSPEECH 2022, 2022, : 4725 - 4729

← 1 2 3 4 5 →