Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space

被引:0
|
作者
Nimitsurachat, Peranut [1 ]
Washington, Peter [2 ]
机构
[1] Stanford Univ, Inst Computat & Math Engn ICME, Stanford, CA 94305 USA
[2] Univ Hawaii Manoa, Informat & Comp Sci, Honolulu, HI 96822 USA
基金
美国国家卫生研究院;
关键词
emotion classification; emotion recognition; self-supervised learning; transfer learning;
D O I
10.3390/ai5010011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU- MOSEI)'s acoustic data. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data with 74 parameters of distinctive audio features at discrete timesteps. Our model is first pre-trained to uncover the randomly masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via overall mean absolute error (MAE), mean absolute error (MAE) per emotion, overall four-class accuracy, and four-class accuracy per emotion. These metrics are compared against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics, especially when the number of annotated data points in the fine-tuning step is small. Furthermore, we quantify the behaviors of the self-supervised model and its convergence as the amount of annotated data increases. This work characterizes the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small and that the effect is most pronounced for emotions which are easier to classify such as happy, sad, and angry. This work further demonstrates that self-supervised learning still improves performance when applied to the embedded feature representations rather than the traditional approach of pre-training on the raw input space.
引用
收藏
页码:195 / 207
页数:13
相关论文
共 50 条
  • [1] An Emotion Recognition Method Based On Feature Fusion and Self-Supervised Learning
    Cao, Xuanmeng
    Sun, Ming
    [J]. 2023 2ND ASIA CONFERENCE ON ALGORITHMS, COMPUTING AND MACHINE LEARNING, CACML 2023, 2023, : 216 - 221
  • [2] SELF-SUPERVISED LEARNING FOR ECG-BASED EMOTION RECOGNITION
    Sarkar, Pritam
    Etemad, Ali
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3217 - 3221
  • [3] Transformer-Based Self-Supervised Learning for Emotion Recognition
    Vazquez-Rodriguez, Juan
    Lefebvre, Gregoire
    Cumin, Julien
    Crowley, James L.
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2605 - 2612
  • [4] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    [J]. PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [5] Transformer-based Self-supervised Representation Learning for Emotion Recognition Using Bio-signal Feature Fusion
    Sawant, Shrutika S.
    Erick, F. X.
    Arora, Pulkit
    Pahl, Jaspar
    Foltyn, Andreas
    Holzer, Nina
    Gotz, Theresa
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
  • [6] Self-Supervised ECG Representation Learning for Emotion Recognition
    Sarkar, Pritam
    Etemad, Ali
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (03) : 1541 - 1554
  • [7] SPEECH EMOTION RECOGNITION USING SELF-SUPERVISED FEATURES
    Morais, Edmilson
    Hoory, Ron
    Zhu, Weizhong
    Gat, Itai
    Damasceno, Matheus
    Aronowitz, Hagai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6922 - 6926
  • [8] Applying Self-Supervised Representation Learning for Emotion Recognition Using Physiological Signals
    Quispe, Kevin G. Montero G.
    Utyiama, Daniel M. S.
    dos Santos, Eulanda M. M.
    Oliveira, Horacio A. B. F.
    Souto, Eduardo J. P.
    [J]. SENSORS, 2022, 22 (23)
  • [9] Audio-based Deep Music Emotion Recognition
    Liu, Tong
    Han, Li
    Ma, Liangkai
    Guo, Dongwei
    [J]. 6TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, MANUFACTURING, MODELING AND SIMULATION (CDMMS 2018), 2018, 1967
  • [10] Using the Fisher Vector Representation for Audio-based Emotion Recognition
    Gosztolya, Gabor
    [J]. ACTA POLYTECHNICA HUNGARICA, 2020, 17 (06) : 7 - 23