MULTI-LINGUAL MULTI-TASK SPEECH EMOTION RECOGNITION USING WAV2VEC 2.0

被引：31

作者：

Sharma, Mayank ^{[1
]}

机构：

[1] Amazon, Chennai, Tamil Nadu, India

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Multi-task Multi-lingual speech emotion recognition; Pre-trained wav2vec 2.0; PANN;

D O I：

10.1109/ICASSP43922.2022.9747417

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech Emotion Recognition (SER) has several use cases for Digital Entertainment Content (DEC) in Over-the-top (OTT) services, emotive Text-to-Speech (TTS) engines and voice assistants. In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model. The model is fine-tuned on 25 open source datasets in 13 locales across 7 emotion categories. We show that, a) Our wav2vec 2.0 single task based model outperforms Pre-trained Audio Neural Network (PANN) based single task pre-trained model by 7.2% (relative), b) The best MTL model outperforms the PANN based and wav2vec 2.0 based single task models by 8.6% and 1.7% (relative) respectively, c) The MTL based system outperforms pre-trained single task wav2vec 2.0 model in 9 out of 13 locales in terms of weighted F1 scores, and d) The MTL-MLi wav2vec 2.0 outperforms the state-of-the-art for the languages contained in the pre-training corpora.

引用

页码：6907 / 6911

页数：5

共 50 条

[1] Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
Pepino, Leonardo
Riera, Pablo
Ferrer, Luciana
INTERSPEECH 2021, 2021, : 3400 - 3404
[2] WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition
Li, Feng
Luo, Jiusong
Xia, Wanjun
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 325 - 336
[3] Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
Stefanel Gris, Lucas Rafael
Casanova, Edresson
de Oliveira, Frederico Santos
Soares, Anderson da Silva
Candido Junior, Arnaldo
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 333 - 343
[4] Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition
Zhao, Zihan
Wang, Yanfeng
Wang, Yu
INTERSPEECH 2022, 2022, : 4725 - 4729
[5] Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition
Sun, Chenjing
Zhou, Yi
Huang, Xin
Yang, Jichen
Hou, Xianhua
ELECTRONICS, 2024, 13 (06)
[6] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
Zhang, Yumei
Jia, Maoshen
Cao, Xuan
Zhao, Zichen
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
[7] Speech recognition model design for Sundanese language using WAV2VEC 2.0
Cryssiover A.
Zahra A.
International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
[8] Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
Park, Somin
Mark, Mpabulungi
Park, Bogyung
Hong, Hyunki
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 77 (01): : 1009 - 1030
[9] Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
Kunesova, Marie
Rezackova, Marketa
TEXT, SPEECH, AND DIALOGUE (TSD 2022), 2022, 13502 : 377 - 388
[10] SERAB: A MULTI-LINGUAL BENCHMARK FOR SPEECH EMOTION RECOGNITION
Scheidwasser-Clow, Neil
Kegler, Mikolaj
Beckmann, Pierre
Cernak, Milos
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7697 - 7701

← 1 2 3 4 5 →