Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

被引:35
|
作者
Siriwardhana, Shamane [1 ]
Reis, Andrew [1 ]
Weerasekera, Rivindu [1 ]
Nanayakkara, Suranga [1 ]
机构
[1] Univ Auckland, Auckland Bioengn Inst, Augmented Human Lab, Auckland, New Zealand
来源
关键词
speech emotion recognition; self supervised learning; Transformers; BERT; multimodal deep learning;
D O I
10.21437/Interspeech.2020-1212
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal emotion recognition from the speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality specific"BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), we show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results. We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones when using SSL models that have similar architectural properties to BERT.
引用
收藏
页码:3755 / 3759
页数:5
相关论文
共 34 条
  • [21] DEEP INVESTIGATION OF INTERMEDIATE REPRESENTATIONS IN SELF-SUPERVISED LEARNING MODELS FOR SPEECH EMOTION RECOGNITION
    Zhu, Zhi
    Sato, Yoshinao
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [22] Enhanced Feature Representation for Multimodal Fake News Detection Using Localized Fine-Tuning of Improved BERT and VGG-19 Models
    Hamed, Suhaib Kh.
    Ab Aziz, Mohd Juzaiddin
    Yaakub, Mohd Ridzwan
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
  • [23] Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning
    Zhang, Yifan
    Hooi, Bryan
    Hu, Dapeng
    Liang, Jian
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [24] Fine-Tuning BERT Models for Intent Recognition Using a Frequency Cut-Off Strategy for Domain-Specific Vocabulary Extension
    Fernandez-Martinez, Fernando
    Luna-Jimenez, Cristina
    Kleinlein, Ricardo
    Griol, David
    Callejas, Zoraida
    Montero, Juan Manuel
    APPLIED SCIENCES-BASEL, 2022, 12 (03):
  • [25] Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT
    Thillainathan, Sarubi
    Ranathunga, Surangika
    Jayasena, Sanath
    MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON 2021) / 7TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, 2021, : 432 - 437
  • [26] NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning
    Yuksel, Kamer Ali
    Ferreira, Thiago Castro
    Javadi, Golara
    Al-Badrashiny, Mohamed
    Gunduz, Ahmet
    INTERSPEECH 2023, 2023, : 466 - 470
  • [27] Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation
    Mujtaba, Dena
    Mahapatra, Nihar R.
    Arne, Megan
    Yaruss, J. Scott
    Herring, Caryn
    Bin, Jia
    INTERSPEECH 2024, 2024, : 1275 - 1279
  • [28] Fine-tuning pre-trained models for Automatic Speech Recognition: experiments on a fieldwork corpus of Japhug (Trans-Himalayan family)
    Guillaume, Severine
    Wisniewski, Guillaume
    Macaire, Cecile
    Jacques, Guillaume
    Michaud, Alexis
    Galliot, Benjamin
    Coavoux, Maximin
    Rossato, Solange
    Minh-Chau Nguyen
    Fily, Maxime
    PROCEEDINGS OF THE FIFTH WORKSHOP ON THE USE OF COMPUTATIONAL METHODS IN THE STUDY OF ENDANGERED LANGUAGES (COMPUTEL-5 2022), 2022, : 170 - 178
  • [29] A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
    Li, Yan
    Wang, Yapeng
    Hoi, Lap Man
    Yang, Dingcheng
    Im, Sio-Kei
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2025, 2025 (01):
  • [30] TRAINING EARLY-EXIT ARCHITECTURES FOR AUTOMATIC SPEECH RECOGNITION: FINE-TUNING PRE-TRAINED MODELS OR TRAINING FROM SCRATCH
    Wright, George August
    Cappellazzo, Umberto
    Zaiem, Salah
    Raj, Desh
    Yang, Lucas Ondel
    Falavigna, Daniele
    Ali, Mohamed Nabih
    Brutti, Alessandro
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 685 - 689