Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space

被引:0
|
作者
Nimitsurachat, Peranut [1 ]
Washington, Peter [2 ]
机构
[1] Stanford Univ, Inst Computat & Math Engn ICME, Stanford, CA 94305 USA
[2] Univ Hawaii Manoa, Informat & Comp Sci, Honolulu, HI 96822 USA
基金
美国国家卫生研究院;
关键词
emotion classification; emotion recognition; self-supervised learning; transfer learning;
D O I
10.3390/ai5010011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU- MOSEI)'s acoustic data. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data with 74 parameters of distinctive audio features at discrete timesteps. Our model is first pre-trained to uncover the randomly masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via overall mean absolute error (MAE), mean absolute error (MAE) per emotion, overall four-class accuracy, and four-class accuracy per emotion. These metrics are compared against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics, especially when the number of annotated data points in the fine-tuning step is small. Furthermore, we quantify the behaviors of the self-supervised model and its convergence as the amount of annotated data increases. This work characterizes the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small and that the effect is most pronounced for emotions which are easier to classify such as happy, sad, and angry. This work further demonstrates that self-supervised learning still improves performance when applied to the embedded feature representations rather than the traditional approach of pre-training on the raw input space.
引用
收藏
页码:195 / 207
页数:13
相关论文
共 50 条
  • [41] GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion Recognition
    Li, Yang
    Chen, Ji
    Li, Fu
    Fu, Boxun
    Wu, Hao
    Ji, Youshuo
    Zhou, Yijin
    Niu, Yi
    Shi, Guangming
    Zheng, Wenming
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (03) : 2512 - 2525
  • [42] Self-supervised Image Classification based on the Distances of Deep Feature Space
    He, Zhuoxun
    Zhang, Ya
    Wang, Yanfeng
    [J]. ICVIP 2019: PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, 2019, : 173 - 177
  • [43] AUDIO ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF AUDIO REPRESENTATION
    Chi, Po-Han
    Chung, Pei-Hung
    Wu, Tsung-Han
    Hsieh, Chun-Cheng
    Chen, Yen-Hao
    Li, Shang-Wen
    Lee, Hung-yi
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 344 - 350
  • [44] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    [J]. INTERSPEECH 2022, 2022, : 2118 - 2122
  • [45] DEEP INVESTIGATION OF INTERMEDIATE REPRESENTATIONS IN SELF-SUPERVISED LEARNING MODELS FOR SPEECH EMOTION RECOGNITION
    Zhu, Zhi
    Sato, Yoshinao
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [46] Prosodic, spectral and voice quality feature selection using a long-term stopping criterion for audio-based emotion recognition
    Kaechele, Markus
    Zharkov, Dimitrij
    Meudt, Sascha
    Schwenker, Friedhelm
    [J]. 2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 803 - 808
  • [47] Self-supervised Visual Feature Learning and Classification Framework: Based on Contrastive Learning
    Wang, Zhibo
    Yan, Shen
    Zhang, Xiaoyu
    Lobo, Niels Da Vitoria
    [J]. 16TH IEEE INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2020), 2020, : 719 - 725
  • [48] Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition
    Vaaras, Einari
    Airaksinen, Manu
    Rasanen, Okko
    [J]. INTERSPEECH 2022, 2022, : 1143 - 1147
  • [49] Road Condition Anomaly Detection using Self-Supervised Learning from Audio
    Gim, U-Ju
    [J]. 2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 675 - 680
  • [50] Detecting Fake Audio of Arabic Speakers Using Self-Supervised Deep Learning
    Almutairi, Zaynab M.
    Elgibreen, Hebah
    [J]. IEEE ACCESS, 2023, 11 : 72134 - 72147