Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks

被引:7
|
作者
Goncalves, Lucas [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75080 USA
来源
关键词
self-supervised learning; speech emotion recognition; audiovisual tasks;
D O I
10.21437/Interspeech.2022-11012
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pre-text tasks for SSL to improve SER. While our target application is SER, the proposed pre-text tasks include audiovisual formulations, leveraging the relationship between acoustic and facial features. Our proposed approach introduces three new unimodal and multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech. Task 1 predicts energy variations (high or low) from a speech sequence. Task 2 uses speech features to predict facial activation (high or low) based on facial landmark movements. Task 3 performs a multi-class emotion recognition task on emotional labels obtained from combinations of action units (AUs) detected across a video sequence. We pre train a network with 60.92 hours of unlabeled data, fine-tuning the model for the downstream SER task. The results on the CREMA-D dataset show that the model pre-trained on the proposed domain-specific pre-text tasks significantly improves the precision (up to 5.1%), recall (up to 4.5%), and F1-scores (up to 4.9%) of our SER system.
引用
下载
收藏
页码:1168 / 1172
页数:5
相关论文
共 50 条
  • [31] Consistency self-supervised learning method for robust automatic speech recognition
    Gao, Changfeng
    Cheng, Gaofeng
    Zhang, Pengyuan
    Shengxue Xuebao/Acta Acustica, 2023, 48 (03): : 578 - 587
  • [32] Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
    Liu, Rui
    Ma, Zening
    arXiv,
  • [33] Progressive Multi-scale Self-supervised Learning for Speech Recognition
    Wan, Genshun
    Chen, Hang
    Liu, Tan
    Wang, Chenxi
    Pan, Jia
    Ye, Zhongfu
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 978 - 982
  • [34] EXPLORING THE INTEGRATION OF SPEECH SEPARATION AND RECOGNITION WITH SELF-SUPERVISED LEARNING REPRESENTATION
    Masuyama, Yoshiki
    Chang, Xuankai
    Zhang, Wangyou
    Cornell, Samuele
    Wang, Zhong-Qiu
    Ono, Nobutaka
    Qian, Yanmin
    Watanabe, Shinji
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [35] Weather Recognition Using Self-supervised Deep Learning
    Acuna-Escobar, Diego
    Intriago-Pazmino, Monserrate
    Ibarra-Fiallo, Julio
    SMART TECHNOLOGIES, SYSTEMS AND APPLICATIONS, SMARTTECH-IC 2021, 2022, 1532 : 161 - 174
  • [36] Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis
    Hu, Di
    Wang, Zheng
    Nie, Feiping
    Wang, Rong
    Li, Xuelong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3534 - 3545
  • [37] Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space
    Nimitsurachat, Peranut
    Washington, Peter
    AI, 2024, 5 (01) : 195 - 207
  • [38] Multi-corpus Affect Recognition with Emotion Embeddings and Self-Supervised Representations of Speech
    Alisamir, Sina
    Ringeval, Fabien
    Portet, Francois
    2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2022,
  • [39] Self-Supervised Domain Adaptation for Computer Vision Tasks
    Xu, Jiaolong
    Xiao, Liang
    Lopez, Antonio M.
    IEEE ACCESS, 2019, 7 : 156694 - 156706
  • [40] CONFORMER-BASED SELF-SUPERVISED LEARNING FOR NON-SPEECH AUDIO TASKS
    Srivastava, Sangeeta
    Wang, Yun
    Tjandra, Andros
    Kumar, Anurag
    Liu, Chunxi
    Singh, Kritika
    Saraf, Yatharth
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8862 - 8866