Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

被引:35
|
作者
Siriwardhana, Shamane [1 ]
Reis, Andrew [1 ]
Weerasekera, Rivindu [1 ]
Nanayakkara, Suranga [1 ]
机构
[1] Univ Auckland, Auckland Bioengn Inst, Augmented Human Lab, Auckland, New Zealand
来源
关键词
speech emotion recognition; self supervised learning; Transformers; BERT; multimodal deep learning;
D O I
10.21437/Interspeech.2020-1212
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Multimodal emotion recognition from the speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality specific"BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), we show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results. We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones when using SSL models that have similar architectural properties to BERT.
引用
收藏
页码:3755 / 3759
页数:5
相关论文
共 34 条
  • [1] SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT
    Huang, Wen-Chin
    Wu, Chia-Hua
    Luo, Shang-Bao
    Chen, Kuan-Yu
    Wang, Hsin-Min
    Toda, Tomoki
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7343 - 7347
  • [2] BERT-ERC: Fine-Tuning BERT Is Enough for Emotion Recognition in Conversation
    Qin, Xiangyu
    Wu, Zhiyu
    Zhang, Tingting
    Li, Yanran
    Luan, Jian
    Wang, Bin
    Wang, Li
    Cui, Jinshi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13492 - 13500
  • [3] Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement
    Yang, Hejung
    Kang, Hong-Goo
    INTERSPEECH 2023, 2023, : 814 - 818
  • [4] Knowledge-based BERT word embedding fine-tuning for emotion recognition
    Zhu, Zixiao
    Mao, Kezhi
    NEUROCOMPUTING, 2023, 552
  • [5] Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition
    Yang, Wei
    Fukayama, Satoru
    Heracleous, Panikos
    Ogata, Jun
    INTERSPEECH 2022, 2022, : 1998 - 2002
  • [6] Self-Supervised Fine-Tuning of Automatic Speech Recognition Systems against Signal Processing Attacks
    Jayawardena, Oshan
    Caldera, Dilmi
    Jayawardena, Sandani
    Sandeepa, Avishka
    Bindschaedler, Vincent
    Charles, Subodha
    PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024, 2024, : 1272 - 1286
  • [7] Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information
    Ta, Bao Thang
    Nguyen, Tung Lam
    Dang, Dinh Son
    Le, Nhat Minh
    Do, Van Hai
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1596 - 1601
  • [8] FINE-TUNING STRATEGIES FOR FASTER INFERENCE USING SPEECH SELF-SUPERVISED MODELS: A COMPARATIVE STUDY
    Zaiem, Salah
    Algayres, Robin
    Parcollet, Titouan
    Essid, Slim
    Ravanelli, Mirco
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [9] Improving fine-tuning of self-supervised models with Contrastive Initialization
    Pan, Haolin
    Guo, Yong
    Deng, Qinyi
    Yang, Haomin
    Chen, Jian
    Chen, Yiqun
    NEURAL NETWORKS, 2023, 159 : 198 - 207
  • [10] Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning
    Kyung, Jehyun
    Heo, Serin
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 4683 - 4687