Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks

被引:7
|
作者
Goncalves, Lucas [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA
来源
关键词
self-supervised learning; speech emotion recognition; audiovisual tasks;
D O I
10.21437/Interspeech.2022-11012
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pre-text tasks for SSL to improve SER. While our target application is SER, the proposed pre-text tasks include audiovisual formulations, leveraging the relationship between acoustic and facial features. Our proposed approach introduces three new unimodal and multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech. Task 1 predicts energy variations (high or low) from a speech sequence. Task 2 uses speech features to predict facial activation (high or low) based on facial landmark movements. Task 3 performs a multi-class emotion recognition task on emotional labels obtained from combinations of action units (AUs) detected across a video sequence. We pre train a network with 60.92 hours of unlabeled data, fine-tuning the model for the downstream SER task. The results on the CREMA-D dataset show that the model pre-trained on the proposed domain-specific pre-text tasks significantly improves the precision (up to 5.1%), recall (up to 4.5%), and F1-scores (up to 4.9%) of our SER system.
引用
收藏
页码:1168 / 1172
页数:5
相关论文
共 50 条
  • [1] Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream Tasks
    Tanaka, Tomohiro
    Masumura, Ryo
    Sato, Hiroshi
    Ihori, Mana
    Matsuura, Kohei
    Ashihara, Takanori
    Moriya, Takafumi
    [J]. INTERSPEECH 2022, 2022, : 1066 - 1070
  • [2] SPEECH EMOTION RECOGNITION USING SELF-SUPERVISED FEATURES
    Morais, Edmilson
    Hoory, Ron
    Zhu, Weizhong
    Gat, Itai
    Damasceno, Matheus
    Aronowitz, Hagai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6922 - 6926
  • [3] Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation
    Huang, Kuan Po
    Fu, Yu-Kuan
    Zhang, Yu
    Lee, Hung-yi
    [J]. INTERSPEECH 2022, 2022, : 2193 - 2197
  • [4] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition
    Atmaja, Bagus Tris
    Sasou, Akira
    [J]. IEEE ACCESS, 2022, 10 : 124396 - 124407
  • [5] MSLTE: multiple self-supervised learning tasks for enhancing EEG emotion recognition
    Li, Guangqiang
    Chen, Ning
    Niu, Yixiang
    Xu, Zhangyong
    Dong, Yuxuan
    Jin, Jing
    Zhu, Hongqin
    [J]. JOURNAL OF NEURAL ENGINEERING, 2024, 21 (02)
  • [6] Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts
    Hyeon, Jonghwan
    Oh, Yung-Hwan
    Lee, Young-Jun
    Choi, Ho-Jin
    [J]. DATA & KNOWLEDGE ENGINEERING, 2024, 150
  • [7] SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION
    Gat, Itai
    Aronowitz, Hagai
    Zhu, Weizhong
    Morais, Edmilson
    Hoory, Ron
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7342 - 7346
  • [8] IMPROVING SELF-SUPERVISED LEARNING FOR SPEECH RECOGNITION WITH INTERMEDIATE LAYER SUPERVISION
    Wang, Chengyi
    Wu, Yu
    Chen, Sanyuan
    Liu, Shujie
    Li, Jinyu
    Qian, Yao
    Yang, Zhenglu
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7092 - 7096
  • [9] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    [J]. PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [10] Investigation of Ensemble of Self-Supervised Models for Speech Emotion Recognition
    Wu, Yanfeng
    Yue, Pengcheng
    Cheng, Cuiping
    Li, Taihao
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 988 - 995