Speech Emotion Recognition via Multi-Level Cross-Modal Distillation

被引:4
|
作者
Li, Ruichen [1 ]
Zhao, Jinming [1 ]
Jin, Qin [1 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
来源
基金
北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
speech emotion recognition; cross-modal transfer; pretraining;
D O I
10.21437/Interspeech.2021-785
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition faces the problem that most of the existing speech corpora are limited in scale and diversity due to the high annotation cost and label ambiguity. In this work, we explore the task of learning robust speech emotion representations based on large unlabeled speech data. Under a simple assumption that the internal emotional states across different modalities are similar, we propose a method called Multi-level Cross-modal Emotion Distillation (MCED), which trains the speech emotion model without any labeled speech emotion data by transferring emotion knowledge from a pretrained text emotion model. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, show that our proposed MCED can help learn effective speech emotion representations which generalize well on downstream speech emotion recognition tasks.
引用
收藏
页码:4488 / 4492
页数:5
相关论文
共 50 条
  • [21] Cross-modal collaborative representation and multi-level supervision for crowd counting
    Li, Shufang
    Hu, Zhengping
    Zhao, Mengyao
    Bi, Shuai
    Sun, Zhe
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (03) : 601 - 608
  • [22] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval
    Dong, Jianfeng
    Long, Zhongzi
    Mao, Xiaofeng
    Lin, Changting
    He, Yuan
    Ji, Shouling
    [J]. NEUROCOMPUTING, 2021, 440 : 207 - 219
  • [23] Cross-modal collaborative representation and multi-level supervision for crowd counting
    Shufang Li
    Zhengping Hu
    Mengyao Zhao
    Shuai Bi
    Zhe Sun
    [J]. Signal, Image and Video Processing, 2023, 17 : 601 - 608
  • [24] Semantic enhancement and multi-level alignment network for cross-modal retrieval
    Chen, Jia
    Zhang, Hong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024,
  • [25] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
    Liang, Jingjun
    Li, Ruichen
    Jin, Qin
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2852 - 2861
  • [26] Low-Order Multi-Level Features for Speech Emotion Recognition
    Tamulevicius, Gintautas
    Liogiene, Tatjana
    [J]. BALTIC JOURNAL OF MODERN COMPUTING, 2015, 3 (04): : 234 - 247
  • [27] Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
    Cho, Won Ik
    Kwak, Donghyun
    Yoon, Ji Won
    Kim, Nam Soo
    [J]. INTERSPEECH 2020, 2020, : 896 - 900
  • [28] Image Emotion Recognition via Fusion Multi-Level Representations
    Zhang, Hao
    Li, Haipeng
    Peng, Guoqin
    Liu, Yan'an
    Xu, Dan
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (10): : 1566 - 1576
  • [29] Multi-level cross-modal contrastive learning for review-aware recommendation
    Wei, Yibiao
    Xu, Yang
    Zhu, Lei
    Ma, Jingwei
    Peng, Chengmei
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 247
  • [30] Speech2Video: Cross-Modal Distillation for Speech to Video Generation
    Si, Shijing
    Wang, Jianzong
    Qu, Xiaoyang
    Cheng, Ning
    Wei, Wenqi
    Zhu, Xinghua
    Xiao, Jing
    [J]. INTERSPEECH 2021, 2021, : 1629 - 1633