Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

被引:1
|
作者
Sun, Chenjing [1 ]
Zhou, Yi [2 ]
Huang, Xin [1 ]
Yang, Jichen [3 ]
Hou, Xianhua [1 ]
机构
[1] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528234, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117583, Singapore
[3] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou 510640, Peoples R China
关键词
speech emotion recognition (SER); wav2vec; 2.0; contrastive learning; MODEL;
D O I
10.3390/electronics13061103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition poses challenges due to the varied expression of emotions through intonation and speech rate. In order to reduce the loss of emotional information during the recognition process and to enhance the extraction and classification of speech emotions and thus improve the ability of speech emotion recognition, we propose a novel approach in two folds. Firstly, a feed-forward network with skip connections (SCFFN) is introduced to fine-tune wav2vec 2.0 and extract emotion embeddings. Subsequently, ConLearnNet is employed for emotion classification. ConLearnNet comprises three steps: feature learning, contrastive learning, and classification. Feature learning transforms the input, while contrastive learning encourages similar representations for samples from the same category and discriminative representations for different categories. Experimental results on the IEMOCAP and the EMO-DB datasets demonstrate the superiority of our proposed method compared to state-of-the-art systems. We achieve a WA and UAR of 72.86% and 72.85% on IEMOCAP, and 97.20% and 96.41% on the EMO-DB, respectively.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
    Baevski, Alexei
    Zhou, Henry
    Mohamed, Abdelrahman
    Auli, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [22] W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition
    Kim, Dong-Hyun
    Lee, Jae-Hong
    Mo, Ji-Hwan
    Chang, Joon-Hyuk
    INTERSPEECH 2022, 2022, : 3038 - 3042
  • [23] A study on fine-tuning wav2vec2.0 Model for the task of Mispronunciation Detection and Diagnosis
    Peng, Linkai
    Fu, Kaiqi
    Lin, Binghuai
    Ke, Dengfeng
    Zhan, Jinsong
    INTERSPEECH 2021, 2021, : 4448 - 4452
  • [24] Wav2f0: Exploring the Potential of Wav2vec 2.0 for Speech Fundamental Frequency Extraction
    Feng, Rui
    Liu, Yin-Long
    Ling, Zhen-Hua
    Yuan, Jia-Hong
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 169 - 173
  • [25] Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlleddifferential equations classifier
    Wang, Ni
    Yang, Danyu
    PLOS ONE, 2025, 20 (02):
  • [26] Improving Speech Translation Accuracy and Time Efficiency With Fine-Tuned wav2vec 2.0-Based Speech Segmentation
    Fukuda, Ryo
    Sudoh, Katsuhito
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 906 - 916
  • [27] Exploring wav2vec 2.0 on speaker verification and language identification
    Fan, Zhiyun
    Li, Meng
    Zhou, Shiyu
    Xu, Bo
    INTERSPEECH 2021, 2021, : 1509 - 1513
  • [28] On-demand compute reduction with stochastic wav2vec 2.0
    Vyas, Apvorv
    Hsu, Wei-Ning
    Auli, Michael
    Baevski, Alexei
    INTERSPEECH 2022, 2022, : 3048 - 3052
  • [29] Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0
    Bayerl, Sebastian P.
    Wagner, Dominik
    Noeth, Elmar
    Riedhammer, Korbinian
    INTERSPEECH 2022, 2022, : 2868 - 2872
  • [30] Unsupervised Spoken Term Discovery Using wav2vec 2.0
    Iwamoto, Yu
    Shinozaki, Takahiro
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1082 - 1086