MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0

被引:31
|
作者
Sharma, Mayank [1 ]
机构
[1] Amazon, Chennai, Tamil Nadu, India
关键词
Multi-task Multi-lingual speech emotion recognition; Pre-trained wav2vec 2.0; PANN;
D O I
10.1109/ICASSP43922.2022.9747417
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech Emotion Recognition (SER) has several use cases for Digital Entertainment Content (DEC) in Over-the-top (OTT) services, emotive Text-to-Speech (TTS) engines and voice assistants. In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model. The model is fine-tuned on 25 open source datasets in 13 locales across 7 emotion categories. We show that, a) Our wav2vec 2.0 single task based model outperforms Pre-trained Audio Neural Network (PANN) based single task pre-trained model by 7.2% (relative), b) The best MTL model outperforms the PANN based and wav2vec 2.0 based single task models by 8.6% and 1.7% (relative) respectively, c) The MTL based system outperforms pre-trained single task wav2vec 2.0 model in 9 out of 13 locales in terms of weighted F1 scores, and d) The MTL-MLi wav2vec 2.0 outperforms the state-of-the-art for the languages contained in the pre-training corpora.
引用
收藏
页码:6907 / 6911
页数:5
相关论文
共 50 条
  • [1] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
    Pepino, Leonardo
    Riera, Pablo
    Ferrer, Luciana
    INTERSPEECH 2021, 2021, : 3400 - 3404
  • [2] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
    Li, Feng
    Luo, Jiusong
    Xia, Wanjun
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
  • [3] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
    Stefanel Gris, Lucas Rafael
    Casanova, Edresson
    de Oliveira, Frederico Santos
    Soares, Anderson da Silva
    Candido Junior, Arnaldo
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
  • [4] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
    Zhao, Zihan
    Wang, Yanfeng
    Wang, Yu
    INTERSPEECH 2022, 2022, : 4725 - 4729
  • [5] Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
    Sun, Chenjing
    Zhou, Yi
    Huang, Xin
    Yang, Jichen
    Hou, Xianhua
    ELECTRONICS, 2024, 13 (06)
  • [6] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
    Zhang, Yumei
    Jia, Maoshen
    Cao, Xuan
    Zhao, Zichen
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
  • [7] Speech recognition model design for Sundanese language using WAV2VEC 2.0
    Cryssiover A.
    Zahra A.
    International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
  • [8] Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
    Park, Somin
    Mark, Mpabulungi
    Park, Bogyung
    Hong, Hyunki
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 1009 - 1030
  • [9] Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
    Kunesova, Marie
    Rezackova, Marketa
    TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 377 - 388
  • [10] SERAB: A MULTI-LINGUAL BENCHMARK FOR SPEECH EMOTION RECOGNITION
    Scheidwasser-Clow, Neil
    Kegler, Mikolaj
    Beckmann, Pierre
    Cernak, Milos
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7697 - 7701