Speech Emotion Recognition via Multi-Level Cross-Modal Distillation

被引:4
|
作者
Li, Ruichen [1 ]
Zhao, Jinming [1 ]
Jin, Qin [1 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
来源
基金
北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
speech emotion recognition; cross-modal transfer; pretraining;
D O I
10.21437/Interspeech.2021-785
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition faces the problem that most of the existing speech corpora are limited in scale and diversity due to the high annotation cost and label ambiguity. In this work, we explore the task of learning robust speech emotion representations based on large unlabeled speech data. Under a simple assumption that the internal emotional states across different modalities are similar, we propose a method called Multi-level Cross-modal Emotion Distillation (MCED), which trains the speech emotion model without any labeled speech emotion data by transferring emotion knowledge from a pretrained text emotion model. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, show that our proposed MCED can help learn effective speech emotion representations which generalize well on downstream speech emotion recognition tasks.
引用
收藏
页码:4488 / 4492
页数:5
相关论文
共 50 条
  • [31] Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks
    Mansouri-Benssassi, Esma
    Ye, Juan
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [32] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Li, Yuan-Fang
    Wu, Yiling
    Li, Minglei
    Zhu, Chuanbo
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [33] Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations
    Wei, Puling
    Yang, Juan
    Xiao, Yali
    [J]. ELECTRONICS, 2024, 13 (13)
  • [34] Cross-modal knowledge distillation for continuous sign language recognition
    Gao, Liqing
    Shi, Peng
    Hu, Lianyu
    Feng, Jichao
    Zhu, Lei
    Wan, Liang
    Feng, Wei
    [J]. NEURAL NETWORKS, 2024, 179
  • [35] Progressive Cross-modal Knowledge Distillation for Human Action Recognition
    Ni, Jianyuan
    Ngu, Anne H. H.
    Yan, Yan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5903 - 5912
  • [36] DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition
    Wang, Sijie
    She, Rui
    Kang, Qiyu
    Jian, Xingchao
    Zhao, Kai
    Song, Yang
    Tay, Wee Peng
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10377 - 10385
  • [37] Speech Emotion Recognition based on Multi-Level Residual Convolutional Neural Networks
    Zheng, Kai
    Xia, ZhiGuang
    Zhang, Yi
    Xu, Xuan
    Fu, Yaqin
    [J]. ENGINEERING LETTERS, 2020, 28 (02) : 559 - 565
  • [38] Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition
    Hong, Soyeon
    Kang, Hyeoungguk
    Cho, Hyunsouk
    [J]. IEEE ACCESS, 2024, 12 : 14324 - 14333
  • [39] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    [J]. MATHEMATICS, 2022, 10 (18)
  • [40] Efficient multi-level cross-modal fusion and detection network for infrared and visible image
    Gao, Hongwei
    Wang, Yutong
    Sun, Jian
    Jiang, Yueqiu
    Gai, Yonggang
    Yu, Jiahui
    [J]. ALEXANDRIA ENGINEERING JOURNAL, 2024, 108 : 306 - 318